Upload
elias
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The ALICE Grid. ALICE-FAIR Computing meeting 29 April 2008 Latchezar Betev. The ALICE Grid basics. Single user interface – AliEn Catalogue Workload management Storage management The AliEn components – see presentation of P.Saiz here - PowerPoint PPT Presentation
Citation preview
The ALICE Grid
ALICE-FAIR Computing meetingALICE-FAIR Computing meeting29 April 200829 April 2008
Latchezar BetevLatchezar Betev
22
The ALICE Grid basicsThe ALICE Grid basicsSingle user interface – AliEnSingle user interface – AliEn
CatalogueCatalogueWorkload managementWorkload managementStorage managementStorage management
The AliEn components – see presentation The AliEn components – see presentation of P.Saiz of P.Saiz here Interfaces hide all of the bare (Interfaces hide all of the bare (and rather and rather
uglyugly) Grid plumbing from the user) Grid plumbing from the userAnd that includes the various Grid implementations And that includes the various Grid implementations
and standards around the wordand standards around the word
ALICE-FAIR meetingALICE-FAIR meeting
33
The ALICE Grid in numbersThe ALICE Grid in numbers65 participating sites65 participating sites
1 T0 (CERN/Switzerland)1 T0 (CERN/Switzerland)6 T1s (France, Germany, Italy, The 6 T1s (France, Germany, Italy, The
Netherlands, Nordic DataGrid Facility, UK)Netherlands, Nordic DataGrid Facility, UK)58 T2s spread over 4 continents58 T2s spread over 4 continents
T2s in Germany - GSI and Muenster T2s in Germany - GSI and Muenster
As of today the ALICE share is some As of today the ALICE share is some 7000 (out of ~30000 total) CPUs and 1.5 7000 (out of ~30000 total) CPUs and 1.5 PB of distributed storagePB of distributed storage
In ½ year ~15K CPUs, x2 storageIn ½ year ~15K CPUs, x2 storage
ALICE-FAIR meetingALICE-FAIR meeting
44
The ALICE Grid historyThe ALICE Grid history First AliEn prototype in 2002First AliEn prototype in 2002
Vertical (full) Grid implementationVertical (full) Grid implementation Some 15 sites, MC production, storage at a single siteSome 15 sites, MC production, storage at a single site
2003-2005 – development of various Grid 2003-2005 – development of various Grid interfaces for AliEninterfaces for AliEn Horizontal (central services + site services) Horizontal (central services + site services)
implementationimplementation Some 30 sites, MC production, storage (still) at a Some 30 sites, MC production, storage (still) at a
single sitesingle siteThere were interfaces to many (raw) storage There were interfaces to many (raw) storage
systems, systems, but no single client library supportbut no single client library support
ALICE-FAIR meetingALICE-FAIR meeting
55
The ALICE Grid history (2)The ALICE Grid history (2) 2006-2008 – refinement of services, Grid sites 2006-2008 – refinement of services, Grid sites
buildup, AliEn Catalogue updates, xrootd as a buildup, AliEn Catalogue updates, xrootd as a single supported I/O protocol, user analysissingle supported I/O protocol, user analysis Majority of sites integrated (4-6 more expected)Majority of sites integrated (4-6 more expected) Standard high-volume MC productionStandard high-volume MC production Central services in full production regimeCentral services in full production regime Rapid deployment (and use) of storage with xrootd Rapid deployment (and use) of storage with xrootd
supportsupport Standard LCG SEs (DPM, CASTOR2, dCache) and xrootd Standard LCG SEs (DPM, CASTOR2, dCache) and xrootd
as pool manageras pool manager User analysis on the GridUser analysis on the Grid
Not as bad as everyone expected Not as bad as everyone expected
ALICE-FAIR meetingALICE-FAIR meeting
66
The ALICE Grid MapThe ALICE Grid MapEurope
Asia
North America
Africa
Here is the live picture ALICE-FAIR meetingALICE-FAIR meeting
77
OperationOperation All sites provide resources through the WLCG All sites provide resources through the WLCG
gateways …or directlygateways …or directly And software: gLite (EGEE), OSG (US), ARC And software: gLite (EGEE), OSG (US), ARC
(NDGF), local batch systems(NDGF), local batch systems A (very short) hint of the existing complexity A (very short) hint of the existing complexity
(only for workload management)(only for workload management) gLite: edg-job-submit, glite-wms-job-submit (…RB, gLite: edg-job-submit, glite-wms-job-submit (…RB,
CE, cancel, check, etc…)CE, cancel, check, etc…) ARC: ngsub (…cluster, type, etc…)ARC: ngsub (…cluster, type, etc…) Local: bsub, qsub (well known to many) Local: bsub, qsub (well known to many) OSG: OSG: globus-job-run (…cluster, proxy type, etc…) All of the above is replaced by AliEn ‘submit’
ALICE-FAIR meetingALICE-FAIR meeting
88
Operation (2)Operation (2) Services schema
Central AliEn services
Site VO-boxSite VO-box Site VO-box
Site VO-boxSite VO-box
WMS (gLite/ARC/OSG/Local)
SM (dCache/DPM/CASTOR/xrootd)
Monitoring, Package management
The VO-box system (very controversial in the beginning)• Has been extensively tested• Allows for site services scaling• Is a simple isolation layer for the VO in case of troubles
ALICE-FAIR meetingALICE-FAIR meeting
99
Operation – central/site supportOperation – central/site support Central services support (2 FTEs equivalent)
There are no experts which do exclusively support – there are 7 highly-qualified experts doing development/support
Site services support - handled by ‘regional experts’ (one per country) in collaboration with local cluster administrators Extremely important part of the system In normal operation ~0.2FTEs/regions
Regular discussions with everybody and active all-activities mailing lists
1010
Operation – critical elements (2)Operation – critical elements (2)Central services, VO-boxes, storage
servers – capable of running 24/7, maintenance freeGet the best hardware money can buy
Multiple service instances in failover configurationAliEn services are enabled for thisUse of ‘load balanced’ DNS aliases for service
endpointsLoad-balancer is external to the system
1111
Operation – critical elements (3)Operation – critical elements (3) Monitoring
Fast and detailed – AliEn command line interface History – MonALISA There is never enough monitoring, but if not careful, it
can saturate the system (and the expert) Automatic tools for production and data
management Lightweight Production Manager – for MC and RAW
data production Lightweight Transfer Manager – for data replication
(sparse storage resources)
Status of PDC’07Status of PDC’07 1212
Central services setupCentral services setup
All started with a bunch of boxes…
Linux 32, 64 bit build servers
MacOS build server
MonALISA repository
3TB xrootd disk servers Application software
Conditions dataUser files
AliEn services Proxy, Authen,
JobBroker,JobOptimizer,
TransferOptimizer, etc..
MySQL DB (replicated)
forAliEn Catalogue
Task Queue
APIServers
Alien.cern.ch
Status of PDC’07Status of PDC’07 1313
Central services upgradeCentral services upgrade
All started with a bunch of boxes…
Versatile expertise required!
1414
Running profile – allRunning profile – all
ALICE-FAIR meetingALICE-FAIR meeting
Last ‘intrusive’ AliEn update
gLite 3.1
1515
Running profile – usersRunning profile – users
ALICE-FAIR meetingALICE-FAIR meeting
Upward slope, but slow
1616
Sites contributionSites contribution
50% resources contribution from T2s!
Harnessing the power of small computing centres is a must
ALICE-FAIR meetingALICE-FAIR meeting
1717
GRID waiting timesGRID waiting times
ALICE-FAIR meetingALICE-FAIR meeting
Standard user distribution
Waiting in the queue
Running time
1818
Current activity in ALICE Current activity in ALICE Substantial part of the resources is used to Substantial part of the resources is used to
produce Monte-Carlo data for physics and produce Monte-Carlo data for physics and detector studiesdetector studies Increasing number of physicists are using the Grid in Increasing number of physicists are using the Grid in
their daily worktheir daily work Since December 2007 the ALICE detector is Since December 2007 the ALICE detector is
being commissioned with cosmic ray triggerbeing commissioned with cosmic ray trigger Reconstruction and analysis of this data is ongoingReconstruction and analysis of this data is ongoing
Ramping up of CPU and storage capacity Ramping up of CPU and storage capacity in preparation for the LHC startup – in preparation for the LHC startup – expected in summer 2008expected in summer 2008
ALICE-FAIR meetingALICE-FAIR meeting
1919
Summary Summary The ALICE Grid is getting ready for the LHC data The ALICE Grid is getting ready for the LHC data
production and analysisproduction and analysis It took 6 ‘short’ years to get thereIt took 6 ‘short’ years to get there
The main system components have been battle-The main system components have been battle-hardenedhardened Development and simultaneous heavy useDevelopment and simultaneous heavy use
The ‘single interface to the Grid’ is a must The ‘single interface to the Grid’ is a must Otherwise the Grid will be limited to ‘selected few’Otherwise the Grid will be limited to ‘selected few’
Integration of computing resources into a coherent Integration of computing resources into a coherent working system takes a lot of time an effortworking system takes a lot of time an effort This is a combined effort between the site and Grid experts, This is a combined effort between the site and Grid experts,
which has to be repeated n times (n=number of sites)which has to be repeated n times (n=number of sites) A trust relation depends not only on the high quality of the Grid A trust relation depends not only on the high quality of the Grid
softwaresoftware
ALICE-FAIR meetingALICE-FAIR meeting
2020
Summary (2) Summary (2) It is never too early to start user analysis It is never too early to start user analysis
on the Gridon the Grid It takes time to ‘convert’ user from local to It takes time to ‘convert’ user from local to
global realityglobal realityAnd the conversion is not without painAnd the conversion is not without pain
The experts should have time to learn how to The experts should have time to learn how to do Grid user support do Grid user support
ALICE-FAIR meetingALICE-FAIR meeting