Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London

Building a distributed software environment for

CDF within the ESLEA framework

V. Bartsch, M. Lancaster

University College London

CDF experiment

located at Fermilab close to Chicago proton/anti-proton collisions at the Tevatron of an energy of 1.2 TeV

CDF

multipurpose detector with discovery potential for the Higgs, studies of b physics and measurement of standard model parameters

luminosity of about 1fb-1 per year

Principle of data analysis

40MB/s2TB/day

assign particle momentum, tracks etc.

raw data reco data

user selection

user data

MCmonte carlo

simulation of the events

user analysis

analysis performed by

~800 physicists in ~60 institutes

CDF – data handling requirements

The experiment has ~ 800 physicists of which ~ 50 are in the UK.

The experiment produces large amounts of data which is stored in the US

~ 1000 Tb per year~ 2000 Tb data stored to date and expect this to rise to 10,000 by 2008

UK physicists:need to be able to copy datasets ( ~ 0.5-10 Tb) quickly to the UKcreate MC data within the UKother UK physicists and other CDF physicists worldwide

data handling numbers

• CDF has acquired

• produces nowadays 1Pb/year, expected to rise to 10Pb by 2008• Fermilab alone is serving about 18 Tb/day

590 TB Raw data

660 TB Reconstructed data

280 TB MC

1530 TB total

Bytes read

TB

ytes

2

6

10

CDF batch computing• 2 types of activities

– organized processing• raw data reconstruction• data reduction for different physics groups• MC production

– user analysis• need to be able to copy datasets (0.5-10Tb)

both use large amount of CPU use the same tools for all

CDF Grid Philosophy

CDF has adopted Grid concepts quite late during run time while it already had a mature software

look & feel of the old data handling system maintained reliability main issue

use existing infrastructure as portal and change software underneath

CDF Analysis Farm (CAF)

Submit and forget until receiving a mail

Does all the job handling and negotiation with the data handling system without the user knowing

• CDF batch job contains a tar ball with all the needed scripts, binaries and shared libraries and send tarball to output location• user need to authenticate with their kerberos ticket

CAF -evolution over time

CDF used several batch systems and distribution mechanisms• FBSNG• Condor• Condor with Globus• gLite WMS

CAF was able to be distributed, run on non-dedicated resources glite WMS helps to run on EGEE sites

Grid based

Used as Productionsystems

Condor-based GRID CAF

Collector

Userpriorities

Negotiator

Userjobs

Schedd

Globus

User Job

User Job

Grid nodes

Starter

StarterNegotiator assigns

nodes to jobs

Globus assignsnodes to VOs

Glide-ins

Pull Model

gLite WMS-based GRID CAF

Push Model

Userjobs

Schedd

Globus

User Job

User Job

Grid nodes

Resource Broker

Pros & ConsCondor based Grid CAFPros:• Globally managed user and job priorities within CDF• Broken nodes kill condor daemons, not user jobs• Resource selection done after a batch slot is secured

Cons:• Uses a single service proxy for all jobs to enter Grid sites• Requires outgoing connectivity

gLite WMS-based Grid CAFPros:• LCG-backed tools• No need for external connectivity• Grid sites can manage users

Cons:• No global fair share for CDF

gLite WMS-based GRID CAF

• at FNAL: CAF worker nodes used to have CDF software distribution NFS mounted, but not an option in the Grid world

• all production jobs are now self-contained

• trying Parrot to distribute CDF software over HTTP in analysis jobs

2610

1335

858

800

261

FNAL remote dedicatedresources

Condor basedGrid CAFs

LCGCafFermiGrid

avg. usable VMs (Virtual Machine)

Number of jobs on the CAF

Some numbers

Data handling system SAM

• SAM manages file storage– Data files are stored in tape systems at FNAL and elsewhere (most

use ENSTORE at FNAL)– Files are cached around the world for fast access

• SAM manages file delivery– Users at FNAL and remote sites retrieve files transparently out of file

storage. SAM handles caching for efficiency

• SAM manages file cataloging– SAM DB holds meta-data for each file transparent to the user

• SAM manages analysis bookkeeping– SAM remembers what files you ran over, what files you processed,

what applications you ran, when you ran them and where

world wide distribution of SAM stations

selected SAM stations

FNAL

CDF:

10k/20k Files declared/day

15k Files consumed/day

8 TByte of Files cons./day

main consumption of main consumption of datadata still central still central remote use on the riseremote use on the rise

test deployment

300Tb

Total CDF Files To User

summary & outlook

• UCL-HEP cluster deployed• UCL-CCC cluster still to come• need a better integration of SAM and the CAF• user feedback needs to be collated

Documents

Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London