Upload
christopher-lloyd
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Tony [email protected]
“Status Report”, GridPP6 Collab. Meeting, Cosener’s House, 31 January 2003
Tony Doyle - University of Glasgow
OutlineOutline
• Historical Perspective?
• “Prototype to Production”
The case so far…
• Timelines…
• Presentation to PPARC Science Committee (26/11/02):
• Questions from Science Committee• Motivation• Overview• How Does the Grid Work?• Middleware Development• Testbed Status• Is the Middleware Robust?• LHC Computing Grid• Applications• Tier-1 and -2 Centre Resources• Dissemination• Achievements• Timeline• UK Grid Future Priorities• Roadmap: Not a Bid
Preliminary Input
Tony Doyle - University of Glasgow
OutlineOutline
• Questions from Science Committee
• Motivation• Overview• How Does the
Grid Work?• Middleware
Development• Testbed Status• Is the Middleware
Robust?
• LHC Computing Grid• Applications• Tier-1 and -2 Centre
Resources• Dissemination• Achievements• Timeline• UK Grid Future
Priorities• Roadmap: Not a Bid
Preliminary Input
Tony Doyle - University of Glasgow
Questions from Questions from Science CommitteeScience Committee
1. What has been achieved in the first year?• See Oversight Committee (3): Executive Summary
2. How are the finances being spent?• Broken down into 5 areas
3. Development in the medium-long term and resources required?• GridPP I (09/01-08/04) Prototype short• GridPP II (09/04-08/07) Production medium• (09/07-) Exploitation long-term
Tony Doyle - University of Glasgow
Rare Phenomena –Rare Phenomena –Huge BackgroundHuge Background
9 or
ders
of
mag
nitu
deThe HIGGS
All interactions
Tony Doyle - University of Glasgow
Executive SummaryExecutive Summary
• Introduction
• Project Management
• Resources
• CERN
• DataGrid
• Applications
• Tier-1/A
• Tier-2
• Dissemination
• Future Funding
Ref: PMB-13-EXEC
• the Grid is starting to work
• under control via Project Map
• Small modifications
• Manpower OK. Hardware
• Making impact
• Engaged (value added)
• Tier-A production mode
• Latent resources
• UK flagship project
• Preliminary planning
Tony Doyle - University of Glasgow
OverviewOverview
EDG - UK Contributions
ArchitectureTestbed-1Network MonitoringCertificates & SecurityStorage Element R-GMALCFGMDS deploymentGridSiteSlashGridSpitfire…
Applications (start-up phase)
BaBarCDF/D0 (SAM)ATLAS/LHCbCMS(ALICE)UKQCD
£17m 3-year project funded by PPARC through the e-Science Programme
CERN - LCG (start-up phase)
funding for staff and hardware...CERN
DataGrid
Tier - 1/A
ApplicationsOperations
http://www.gridpp.ac.uk
Tony Doyle - University of Glasgow
£17m++ 3-Year Project£17m++ 3-Year Project
• Five components– Tier-1/A = Hardware + CLRC e-Science Staff
– DataGrid = 15 DataGrid Posts + CLRC PPD Staff
– Applications = 13 Experiments Posts (to interface middleware)
– Operations = Travel (~100 people)+ Management + Early Investment
– CERN = 25 LCG posts + Tier-0 + LTA
6/Oct/2002
£3.79m
£5.67m
£3.67m
£2.08m£1.79m
CERN
DataGrid
Tier - 1/A
ApplicationsOperations
Tony Doyle - University of Glasgow
Project Management Project Management - 7 Elements- 7 Elements
Tony Doyle - University of Glasgow
Year 0 Year 0 Year 1 Year 1
The Project
hasnow
completedoneyear….
Tony Doyle - University of Glasgow
How Does theHow Does theGrid Work?Grid Work?
1. Authenticationgrid-proxy-init
2. Job submissiondg-job-submit
3. Monitoring and controldg-job-statusdg-job-canceldg-job-get-output
4. Data publication and replication
globus-url-copy, GDMP
5. Resource scheduling – use of Mass Storage Systems
JDL, sandboxes, storage elements
0. Web User Interface…
Tony Doyle - University of Glasgow
Middleware DevelopmentMiddleware Development
Tony Doyle - University of Glasgow
Testbed StatusTestbed Status
Tony Doyle - University of Glasgow
Is the Middleware Robust?Is the Middleware Robust?
1. Code Base
2. Software Evaluation Process
3. Testbed Infrastructure: WP-specific DevelopmentCertificationApplication.
4. Code Development Platforms
Tony Doyle - University of Glasgow
LHC Computing GridLHC Computing Grid
• Grid deployment project
• Not grid development
• Establishes the global computing infrastructure
• Allows all participating physicists to exploit LHC data
• Fosters and develops the required collaboration between– LHC experiments– peer computing centres– middleware providers
• Based at CERN
• Started March 2002
Tony Doyle - University of Glasgow
LHC Computing GridLHC Computing Grid
• Solid programme established
• Agreed between and supported by all LHC experiments
• LHC experiments contributing to development of common software (e.g. new persistency solution)
• First LHC global grid service for deployment in July 2003
• Basis for a generic Science Grid Infrastructure
Tony Doyle - U niversity of G lasgow
PO O LPO O L
Dict ionar yS vcS t r eamer S vcS t r eamer S vc
Per sist encyM gr
I Refl ect ionS t r eamer S vc Dict ionar yS vc
S t or ageM gr
CacheM gr
I PRefl ect ion
F ileCat alog
I Cnv
I ReadW r ite
I Per s
C++
Placement S vc
I F Catalog
I Placement
Dict ionar yS vcDict ionar yS vcS t r eamer S vcS t r eamer S vcS t r eamer S vcS t r eamer S vc
Per sist encyM grPer sist encyM gr
I Refl ect ionS t r eamer S vcS t r eamer S vc Dict ionar yS vcDict ionar yS vc
S t or ageM grS t or ageM gr
CacheM grCacheM gr
I PRefl ect ion
F ileCat alogF ileCat alog
I Cnv
I ReadW r ite
I Per s
C++
Placement S vcPlacement S vc
I F Catalog
I Placement
P ersistency F ram ework :
Tony Doyle - University of Glasgow
Application InterfacesApplication Interfaces
Fabric
TapeStorage
Elements
RequestFormulator and
Planner
Client Applications
ComputeElements
Indicates component that w ill be replaced
DiskStorage
Elements
LANs andWANs
Resource andServices Catalog
ReplicaCatalog
Meta-dataCatalog
Authentication and SecurityGSISAM-specific user, group , node, st at ion regis tration B bftp ‘cookie’
Connectivity and Resource
CORBA UDP File transfer protocol s - ftp, b bftp, rcp GridFTP
Mass Storage s ystems protocol se.g. encp, hp ss
Collective Services
C atalogproto co ls
Signi fi cant Event Log ger Naming Service Database ManagerC atalog Manager
SAM R es ource M an ag em entB atch Sys tems - LSF, FB S, PB S,
C ondorData Mov erJob Services
Storage ManagerJob ManagerCache ManagerRequest Manager
“Dataset Editor” “File Storage Server”“Project Master” “Station M aster” “Station M aster”
Web Python codes, Java codesCom mand line D0 Fram ework C++ codes
“Stager”“Optim iser”
CodeRepostory
Name in “quotes” is SAM-given software component name
or addedenhanced using PPDG and Grid tools
Tony Doyle - University of Glasgow
Tier-1/A: Year of GrowthTier-1/A: Year of Growth
CSF Linux Weekly CPU Use
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
01
/01
/20
01
05
/02
/20
01
12
/03
/20
01
16
/04
/20
01
21
/05
/20
01
25
/06
/20
01
30
/07
/20
01
03
/09
/20
01
08
/10
/20
01
12
/11
/20
01
17
/12
/20
01
21
/01
/20
02
25
/02
/20
02
01
/04
/20
02
06
/05
/20
02
10
/06
/20
02
15
/07
/20
02
19
/08
/20
02
23
/09
/20
02
No
rma
lise
d P
45
0 C
PU
Ho
urs
P450 cpu hours
CSF Linux Accounts : Since April 2002
delphiuk4%
lhcb1%
atlas13%
antaresw5%
theory14%
cms2%
h18%
bfactory46%
sno3%
zeus4%
020406080
100120140
Oct-01
Dec-01
Feb-02
Apr-02
Jun-02
Aug-02
Personal
Server
GridPP Certificates
BaBar Use
Tony Doyle - University of Glasgow
Tier-2 Web-BasedTier-2 Web-BasedMonitoringMonitoring
ScotGrid reached its 100,000th processing
hour on Wednesday 13th November 2002.
Tony Doyle - University of Glasgow
Tier-1 and -2 Centre ResourcesTier-1 and -2 Centre Resources
Estimated resources at end of 2003 (from Institute returns)
Number of CPUs
Amount of Disk (TB)
Amount of Tape (TB)
London 1252 76 8
Northern 2015 113 33
Southern 392 29 20
ScotGrid 154 35 0
Number of CPUs
Amount of Disk (TB)
BaBar 1108 29% 76 30%
CDF 453 12% 39 15%
D0 320 8% 8 3%
ATLAS 947 25% 72 28%
CMS 304 8% 9 4%
LHCb 543 14% 40 16%
Shared Distributed Resources 2003
BaBar
CDF
D0
ATLAS
CMS
LHCb
Tier-1
• Tier-1: 600 CPUs + 150 TB• Tier-2: (4000 CPUs + 200 TB)
Tony Doyle - University of Glasgow
Dissemination:Dissemination: e-Science and Web e-Science and Web
e-Science ‘All Hands’ Meeting held at Sheffield, 2-4 September 2002– ~ 300 people in total– ~ 19 GridPP People attended– ~ 13 GridPP ‘Abstracts’ accepted (total ~100)– ~ 10 GridPP Posters displayed– 4 GridPP Invited talks– 3 GridPP Demonstrations
GridPP Web Page Requests:
25,000 per month
~ ½ non-UK
.com (commercial)
Tony Doyle - University of Glasgow
Dissemination:Dissemination:PostersPosters
ATLAS SAM
OptorSimGridPP Tier-1/A ScotGrid
BaBarLHCbCMS
Storage
10 Posters for NeSC Opening and e-Science All Hands Meeting
Tony Doyle - University of Glasgow
Dissemination:Dissemination:DemonstrationsDemonstrations
• Super Computing 2002 Baltimore, US
• Major event last week
• GridPP participated in three successful Worldwide demos– WorldGrid– Replica Location Service– SAMGrid
• These and other Web-based demos available online from http://www.gridpp.ac.uk/demos/
Tony Doyle - University of Glasgow
Achievements IAchievements I
1. Dedicated people actively developing a Grid
2. All with personal certificates
3. Using the largest UK grid testbed(16 sites and more than 100 servers)
4. Deployed within EU-wide programme
5. Linked to Worldwide Grid testbeds
Tony Doyle - University of Glasgow
Achievements IIAchievements II
6. Grid Deployment Programme Defined The Basis for LHC Computing
7. Active Tier-1/A Production Centre meeting International Requirements
8. Latent Tier-2 resources being monitored
9. Significant middleware development programme
10.First simple applications using the Grid testbed (open approach)
Tony Doyle - University of Glasgow
020406080
100120140
Oct-01
Dec-01
Feb-02
Apr-02
Jun-02
Aug-02
Personal
Server020406080
100120140
Oct-01
Dec-01
Feb-02
Apr-02
Jun-02
Aug-02
Personal
Server
GridPP
From Prototype (hundreds) to Production (thousands) Grid..
2002 2005200420032001
Generaldeploymentof e-Sciencemethods…
Certs.
Tony Doyle - University of Glasgow
TimelineTimeline
2002 200520042003
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4Q1 Q2 Q3 Q4Q1 Q2 Q3 Q4
GridPP-Procure, Install, Compute, Data
Develop, Test, Refine
LHC Computing Grid
Initial Grid Tests
EGEE (DataGrid II?)
GridPP II
Grid Service
PrototypesPrototypes ProductionProduction
2001
Q1 Q2 Q3 Q4
DataGrid
Middleware and Hardware upgrades
Worldwide Grid Demonstrations
Transition and Planning Phase…
Tony Doyle - University of Glasgow
UK Grid Future PrioritiesUK Grid Future Priorities
1. Tier-1/A staff
2. Tier-1/A hardware
3. Tier-2 staff
4. Applications staff
5. Middleware development
6. Tier-2 hardware
7. CERN staff
8. CERN hardware
• Preliminary assessment phase…
ALL of
theseare
required to
address the LHC
Computing Challenge
Tony Doyle - University of Glasgow
8%
13%
8%
6%3%1%5%7%
30%
8%6% 3% 2%
Tier-1 Hardware Tier-1 StaffTier-2 Hardware Tier-2 StaffTier-2 Staff (Inst.) CERN HardwareCERN Staff App. IntegrationApp. Development MiddlewareMiddleware (EGEE) ManagersTravel
8%
13%
8%
6%3%1%5%7%
30%
8%6% 3% 2%
Applications
CERN
Tier-2
Tier-1Ops.
Middleware
Roadmap: Not a BidRoadmap: Not a BidPreliminary InputPreliminary Input
GridPP I
2003 2004 2005 2006 2007 2008
GridPP II Exploitation
GridPP II era
Tony Doyle - University of Glasgow
UK Grid Future PrioritiesUK Grid Future Priorities
1. Tier-1/A staff – National Grid Centre
2. Tier-1/A hardware – International Role
3. Tier-2 staff – UK e-Science Grid
4. Applications1. Grid-enablement
2. Applications development
5. Middleware – EU-wide development
6. Tier-2 hardware – non-PPARC funding
7. CERN staff – UK pro-rata contribution
8. CERN hardware – pro-rata contribution
• Preliminary assessment phase…
ALL of
theseare
required to
address the LHC
Computing Challenge
Tony Doyle - University of Glasgow
HistoryHistory
• Teaches us nothing?• Review latest status• There was ~nothing
18 months ago…
Tony Doyle - University of Glasgow
Application Testbed ResourcesApplication Testbed Resources
• Since Last Year:– Improved software (EDG 1.4.3).
– Doubled sites. More waiting…• Australia, Taiwan, USA (U.
Wisc.), UK Sites, CrossGrid, …
– Significantly more CPU/Storage.
• Hidden Infrastructure– MDS Hierarchy
– Resource Brokers
– User Interfaces
– VO Replica Catalogs
– VO Membership Servers
– Certification Authorities
Site Country CPUs Storage
CC-IN2P3* FR 620 192 GB
CERN* CH 138 1321 GB
CNAF* IT 48 1300 GB
Ecole Poly. FR 6 220 GB
Imperial Coll. UK 92 450 GB
Liverpool UK 2 10 GB
Manchester UK 9 15 GB
NIKHEF* NL 142 433 GB
Oxford UK 1 30 GB
Padova IT 11 666 GB
RAL* UK 6 332 GB
SARA NL 0 254 GB
TOTAL 5 1075 5223 GB
Tony Doyle - University of Glasgow
HistoryHistoryVersion Date
1.1.2 27 Feb 2002
1.1.3 02 Apr 2002
1.1.4 04 Apr 2002
1.2.a1 11 Apr 2002
1.2.b1 31 May 2002
1.2.0 12 Aug 2002
1.2.1 04 Sep 2002
1.2.2 09 Sep 2002
1.2.3 25 Oct 2002
1.3.0 08 Nov 2002
1.3.1 19 Nov 2002
1.3.2 20 Nov 2002
1.3.3 21 Nov 2002
1.3.4 25 Nov 2002
1.4.0 06 Dec 2002
1.4.1 07 Jan 2003
1.4.2 09 Jan 2003
1.4.3 14 Jan 2003
RC Changes
Mixed Globus 2.0/2.2RB/JSS Upgrade
Known Problems:• GASS Cache Coherency• Race Conditions in Gatekeeper• Unstable MDS
Successes• Improved MDS Stability• FTP Transfers OKKnown Problems:• Interactions with RC
Intense Use by Applications!Limitations: • Resource Exhaustion• Size of Logical Collections
Successes• Matchmaking/Job Mgt.• Basic Data Mgt.Known Problems:• High Rate Submissions• Long FTP Transfers
Tony Doyle - University of Glasgow
HistoryHistory-relating applications -relating applications work to TB versionswork to TB versions Version Date
1.1.2 27 Feb 2002
1.1.3 02 Apr 2002
1.1.4 04 Apr 2002
1.2.a1 11 Apr 2002
1.2.b1 31 May 2002
1.2.0 12 Aug 2002
1.2.1 04 Sep 2002
1.2.2 09 Sep 2002
1.2.3 25 Oct 2002
1.3.0 08 Nov 2002
1.3.1 19 Nov 2002
1.3.2 20 Nov 2002
1.3.3 21 Nov 2002
1.3.4 25 Nov 2002
1.4.0 06 Dec 2002
1.4.1 07 Jan 2003
1.4.2 09 Jan 2003
1.4.3 14 Jan 2003
RC Changes
Mixed Globus 2.0/2.2RB/JSS Upgrade
Known Problems:• GASS Cache Coherency• Race Conditions in Gatekeeper• Unstable MDS
Successes• Improved MDS Stability• FTP Transfers OKKnown Problems:• Interactions with RC
Real Use by Applications!Limitations: • Resource Exhaustion• Size of Logical Collections
Successes• Matchmaking/Job Mgt.• Basic Data Mgt.Known Problems:• High Rate Submissions• Long FTP Transfers
ATLAS commence phase1 tests
CMS start stress tests Nov 30
which continue till Dec 20
•Problems with long jobs•Instability in MDS•Long file transfers unreliable
CMS and ATLAS evaluate 1.4.3
Tony Doyle - University of Glasgow
EDG reasons of failure (categories)EDG reasons of failure (categories) Preliminary analysisPreliminary analysis
CMKIN jobsStatus Totals
Crashed jobs 818
Reasons of Failure for Crashed jobs
No matching resource found 509
Generic Failure: MyProxyServer not found in JDL expr. 102Failure while executing job wrapper 37Other failures 33
CMSIM jobsStatus Totals
Crashed jobs 2662
Reasons of Failure for Crashed jobs
Failure while executing job wrapper 1476No matching resource found 722Globus Failure: Globus down/Submit to globus failed 144Running 116Globus Failure 90Other failures 114
Short jobs
Long jobs
Tony Doyle - University of Glasgow
EDG Release Plan EDG Release Plan on Application Testbedon Application Testbed
• Port to RH 7.3 & LCFGng January
• Upgrade to globus 2.2.4 & condor 6.4.6 February
• R-GMA
• RLS
• Storage Element
• NetworkCost Suite
• Replica Mgmt Services March & April
• LCAS with dynamic plug-in modules
• New Resource Broker + GLUE schema
• GridFTP access to CASTOR
• VOMS
https://edms.cern.ch/document/333297
LCG defines components for first release (not just EDG)
Deployed May 2003
Tony Doyle - University of Glasgow
The Road Ahead?The Road Ahead?
• Plan for success in middleware development…
• http://www.roadahead.com/• First two links are broken..
• Define the Production Phase as GridPP II (Sep 04…)
Tony Doyle - University of Glasgow
GridPP Timeline…GridPP Timeline…
• 29/10/02 CB ~outline approval for GridPP2 (“Prototype to Production”) and EGEE (“Grid = Network++”)
• Dec 02 EU Call for FP6 (EGEE)• Jan 03 GridPP Tier-2 tenders • Four ½ posts to be assigned to regional management
(Testbed: Grid enablement of Hardware)• Feb 03 PPARC Call for e-Science proposals• 19/02/03 CB review of GridPP II plans• Apr 03 EGEE Proposal submission• ** Tier-1 and Tier-2 Centres
not yet defined (work with prototypes)**• End-May 03 GridPP II Proposal • Sep 03 Approval? (Limited) Tier-2 funding starts• Dec 03 DataGrid funding ends• ~Jan 04 DataGrid’ as part of EGEE?• Sep 04 Start of GridPP II Production Phase…