20
The ALICE Grid ALICE-FAIR Computing ALICE-FAIR Computing meeting meeting 29 April 2008 29 April 2008 Latchezar Betev Latchezar Betev

The ALICE Grid

  • Upload
    elias

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

The ALICE Grid. ALICE-FAIR Computing meeting 29 April 2008 Latchezar Betev. The ALICE Grid basics. Single user interface – AliEn Catalogue Workload management Storage management The AliEn components – see presentation of P.Saiz here - PowerPoint PPT Presentation

Citation preview

Page 1: The ALICE Grid

The ALICE Grid

ALICE-FAIR Computing meetingALICE-FAIR Computing meeting29 April 200829 April 2008

Latchezar BetevLatchezar Betev

Page 2: The ALICE Grid

22

The ALICE Grid basicsThe ALICE Grid basicsSingle user interface – AliEnSingle user interface – AliEn

CatalogueCatalogueWorkload managementWorkload managementStorage managementStorage management

The AliEn components – see presentation The AliEn components – see presentation of P.Saiz of P.Saiz here Interfaces hide all of the bare (Interfaces hide all of the bare (and rather and rather

uglyugly) Grid plumbing from the user) Grid plumbing from the userAnd that includes the various Grid implementations And that includes the various Grid implementations

and standards around the wordand standards around the word

ALICE-FAIR meetingALICE-FAIR meeting

Page 3: The ALICE Grid

33

The ALICE Grid in numbersThe ALICE Grid in numbers65 participating sites65 participating sites

1 T0 (CERN/Switzerland)1 T0 (CERN/Switzerland)6 T1s (France, Germany, Italy, The 6 T1s (France, Germany, Italy, The

Netherlands, Nordic DataGrid Facility, UK)Netherlands, Nordic DataGrid Facility, UK)58 T2s spread over 4 continents58 T2s spread over 4 continents

T2s in Germany - GSI and Muenster T2s in Germany - GSI and Muenster

As of today the ALICE share is some As of today the ALICE share is some 7000 (out of ~30000 total) CPUs and 1.5 7000 (out of ~30000 total) CPUs and 1.5 PB of distributed storagePB of distributed storage

In ½ year ~15K CPUs, x2 storageIn ½ year ~15K CPUs, x2 storage

ALICE-FAIR meetingALICE-FAIR meeting

Page 4: The ALICE Grid

44

The ALICE Grid historyThe ALICE Grid history First AliEn prototype in 2002First AliEn prototype in 2002

Vertical (full) Grid implementationVertical (full) Grid implementation Some 15 sites, MC production, storage at a single siteSome 15 sites, MC production, storage at a single site

2003-2005 – development of various Grid 2003-2005 – development of various Grid interfaces for AliEninterfaces for AliEn Horizontal (central services + site services) Horizontal (central services + site services)

implementationimplementation Some 30 sites, MC production, storage (still) at a Some 30 sites, MC production, storage (still) at a

single sitesingle siteThere were interfaces to many (raw) storage There were interfaces to many (raw) storage

systems, systems, but no single client library supportbut no single client library support

ALICE-FAIR meetingALICE-FAIR meeting

Page 5: The ALICE Grid

55

The ALICE Grid history (2)The ALICE Grid history (2) 2006-2008 – refinement of services, Grid sites 2006-2008 – refinement of services, Grid sites

buildup, AliEn Catalogue updates, xrootd as a buildup, AliEn Catalogue updates, xrootd as a single supported I/O protocol, user analysissingle supported I/O protocol, user analysis Majority of sites integrated (4-6 more expected)Majority of sites integrated (4-6 more expected) Standard high-volume MC productionStandard high-volume MC production Central services in full production regimeCentral services in full production regime Rapid deployment (and use) of storage with xrootd Rapid deployment (and use) of storage with xrootd

supportsupport Standard LCG SEs (DPM, CASTOR2, dCache) and xrootd Standard LCG SEs (DPM, CASTOR2, dCache) and xrootd

as pool manageras pool manager User analysis on the GridUser analysis on the Grid

Not as bad as everyone expected Not as bad as everyone expected

ALICE-FAIR meetingALICE-FAIR meeting

Page 6: The ALICE Grid

66

The ALICE Grid MapThe ALICE Grid MapEurope

Asia

North America

Africa

Here is the live picture ALICE-FAIR meetingALICE-FAIR meeting

Page 7: The ALICE Grid

77

OperationOperation All sites provide resources through the WLCG All sites provide resources through the WLCG

gateways …or directlygateways …or directly And software: gLite (EGEE), OSG (US), ARC And software: gLite (EGEE), OSG (US), ARC

(NDGF), local batch systems(NDGF), local batch systems A (very short) hint of the existing complexity A (very short) hint of the existing complexity

(only for workload management)(only for workload management) gLite: edg-job-submit, glite-wms-job-submit (…RB, gLite: edg-job-submit, glite-wms-job-submit (…RB,

CE, cancel, check, etc…)CE, cancel, check, etc…) ARC: ngsub (…cluster, type, etc…)ARC: ngsub (…cluster, type, etc…) Local: bsub, qsub (well known to many) Local: bsub, qsub (well known to many) OSG: OSG: globus-job-run (…cluster, proxy type, etc…) All of the above is replaced by AliEn ‘submit’

ALICE-FAIR meetingALICE-FAIR meeting

Page 8: The ALICE Grid

88

Operation (2)Operation (2) Services schema

Central AliEn services

Site VO-boxSite VO-box Site VO-box

Site VO-boxSite VO-box

WMS (gLite/ARC/OSG/Local)

SM (dCache/DPM/CASTOR/xrootd)

Monitoring, Package management

The VO-box system (very controversial in the beginning)• Has been extensively tested• Allows for site services scaling• Is a simple isolation layer for the VO in case of troubles

ALICE-FAIR meetingALICE-FAIR meeting

Page 9: The ALICE Grid

99

Operation – central/site supportOperation – central/site support Central services support (2 FTEs equivalent)

There are no experts which do exclusively support – there are 7 highly-qualified experts doing development/support

Site services support - handled by ‘regional experts’ (one per country) in collaboration with local cluster administrators Extremely important part of the system In normal operation ~0.2FTEs/regions

Regular discussions with everybody and active all-activities mailing lists

Page 10: The ALICE Grid

1010

Operation – critical elements (2)Operation – critical elements (2)Central services, VO-boxes, storage

servers – capable of running 24/7, maintenance freeGet the best hardware money can buy

Multiple service instances in failover configurationAliEn services are enabled for thisUse of ‘load balanced’ DNS aliases for service

endpointsLoad-balancer is external to the system

Page 11: The ALICE Grid

1111

Operation – critical elements (3)Operation – critical elements (3) Monitoring

Fast and detailed – AliEn command line interface History – MonALISA There is never enough monitoring, but if not careful, it

can saturate the system (and the expert) Automatic tools for production and data

management Lightweight Production Manager – for MC and RAW

data production Lightweight Transfer Manager – for data replication

(sparse storage resources)

Page 12: The ALICE Grid

Status of PDC’07Status of PDC’07 1212

Central services setupCentral services setup

All started with a bunch of boxes…

Linux 32, 64 bit build servers

MacOS build server

MonALISA repository

3TB xrootd disk servers Application software

Conditions dataUser files

AliEn services Proxy, Authen,

JobBroker,JobOptimizer,

TransferOptimizer, etc..

MySQL DB (replicated)

forAliEn Catalogue

Task Queue

APIServers

Alien.cern.ch

Page 13: The ALICE Grid

Status of PDC’07Status of PDC’07 1313

Central services upgradeCentral services upgrade

All started with a bunch of boxes…

Versatile expertise required!

Page 14: The ALICE Grid

1414

Running profile – allRunning profile – all

ALICE-FAIR meetingALICE-FAIR meeting

Last ‘intrusive’ AliEn update

gLite 3.1

Page 15: The ALICE Grid

1515

Running profile – usersRunning profile – users

ALICE-FAIR meetingALICE-FAIR meeting

Upward slope, but slow

Page 16: The ALICE Grid

1616

Sites contributionSites contribution

50% resources contribution from T2s!

Harnessing the power of small computing centres is a must

ALICE-FAIR meetingALICE-FAIR meeting

Page 17: The ALICE Grid

1717

GRID waiting timesGRID waiting times

ALICE-FAIR meetingALICE-FAIR meeting

Standard user distribution

Waiting in the queue

Running time

Page 18: The ALICE Grid

1818

Current activity in ALICE Current activity in ALICE Substantial part of the resources is used to Substantial part of the resources is used to

produce Monte-Carlo data for physics and produce Monte-Carlo data for physics and detector studiesdetector studies Increasing number of physicists are using the Grid in Increasing number of physicists are using the Grid in

their daily worktheir daily work Since December 2007 the ALICE detector is Since December 2007 the ALICE detector is

being commissioned with cosmic ray triggerbeing commissioned with cosmic ray trigger Reconstruction and analysis of this data is ongoingReconstruction and analysis of this data is ongoing

Ramping up of CPU and storage capacity Ramping up of CPU and storage capacity in preparation for the LHC startup – in preparation for the LHC startup – expected in summer 2008expected in summer 2008

ALICE-FAIR meetingALICE-FAIR meeting

Page 19: The ALICE Grid

1919

Summary Summary The ALICE Grid is getting ready for the LHC data The ALICE Grid is getting ready for the LHC data

production and analysisproduction and analysis It took 6 ‘short’ years to get thereIt took 6 ‘short’ years to get there

The main system components have been battle-The main system components have been battle-hardenedhardened Development and simultaneous heavy useDevelopment and simultaneous heavy use

The ‘single interface to the Grid’ is a must The ‘single interface to the Grid’ is a must Otherwise the Grid will be limited to ‘selected few’Otherwise the Grid will be limited to ‘selected few’

Integration of computing resources into a coherent Integration of computing resources into a coherent working system takes a lot of time an effortworking system takes a lot of time an effort This is a combined effort between the site and Grid experts, This is a combined effort between the site and Grid experts,

which has to be repeated n times (n=number of sites)which has to be repeated n times (n=number of sites) A trust relation depends not only on the high quality of the Grid A trust relation depends not only on the high quality of the Grid

softwaresoftware

ALICE-FAIR meetingALICE-FAIR meeting

Page 20: The ALICE Grid

2020

Summary (2) Summary (2) It is never too early to start user analysis It is never too early to start user analysis

on the Gridon the Grid It takes time to ‘convert’ user from local to It takes time to ‘convert’ user from local to

global realityglobal realityAnd the conversion is not without painAnd the conversion is not without pain

The experts should have time to learn how to The experts should have time to learn how to do Grid user support do Grid user support

ALICE-FAIR meetingALICE-FAIR meeting