33
GridPP Deployment Status Steve Traylen Steve Traylen [email protected] 28th October 2004 GOSC Face to Face, NESC, UK

GridPP Deployment Status Steve Traylen [email protected] 28th October 2004 GOSC Face to Face, NESC, UK

Embed Size (px)

Citation preview

Page 1: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

GridPP Deployment Status

Steve Traylen

Steve [email protected]

28th October 2004GOSC Face to Face, NESC, UK

Page 2: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Contents

• Middleware components of the GridPP Production System

• Status of the current operational Grid

• Future plans and challenges

• Summary

• GridPP 2 – From Prototype to Production

Page 3: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

CMS LHCb ATLAS ALICE

1 Megabyte (1MB)A digital photo

1 Gigabyte (1GB) = 1000MBA DVD movie

1 Terabyte (1TB)= 1000GBWorld annual book production

1 Petabyte (1PB)= 1000TBAnnual production of one LHC experiment

1 Exabyte (1EB)= 1000 PBWorld annual information production

les.

rob

ert

son

@ce

rn.c

h

The physics driver

• 40 million collisions per second

• After filtering, 100-200 collisions of interest per second

• 1-10 Megabytes of data digitised for each collision = recording rate of 0.1-1 Gigabytes/sec

• 1010 collisions recorded each year = ~10 Petabytes/year of data

The LHC

Page 4: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

The UK response

GridPPGridPP – A UK Computing Grid for

Particle Physics

19 UK Universities, CCLRC (RAL & Daresbury) and CERN

Funded by the Particle Physics and Astronomy Research Council (PPARC)

GridPP1 - Sept. 2001-2004 £17m "From Web to Grid“

GridPP2 – Sept. 2004-2007 £16(+1)m "From Prototype to Production"

Page 5: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Current context of GridPP

UK Core e-Science Programme

Institutes

Tier-2 Centres

CERNLCG

EGEE

GridPPTier-1/A

Middleware, Security,

Networking

Experiments

GridSupportCentre

Not to scale!

Apps Dev

AppsI nt

GridPP

UK Core e-Science Programme

Institutes

Tier-2 Centres

CERNLCG

CERNLCG

EGEE

GridPPGridPPTier-1/ATier-1/A

Middleware, Security,

Networking

Middleware, Security,

Networking

Experiments

GridSupportCentre

GridSupportCentre

Not to scale!

Apps DevApps Dev

AppsI nt

GridPP

Page 6: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Our grid is working …

NorthGrid ****Daresbury, Lancaster, Liverpool,Manchester, Sheffield

SouthGrid *Birmingham, Bristol, Cambridge,Oxford, RAL PPD, Warwick

ScotGrid *Durham, Edinburgh, Glasgow

LondonGrid ***Brunel, Imperial, QMUL, RHUL, UCL

Page 7: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

… and is part of LCG

Resources are being used for data challenges

• Within the UK we have some VO/experiment Memorandum of Understandings in place

• Tier-2 structure is working well

Page 8: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Scale

GridPP prototype Grid> 1,000 CPUs

– 500 CPUs at the Tier-1 at RAL

> 500 CPUs at 11 sites across UK organised in 4 Regional Tier-2s

> 500 TB of storage> 800 simultaneous jobs

• Integrated with international LHC Computing Grid (LCG)

> 5,000 CPUs> 4,000 TB of storage> 85 sites around the world> 4,000 simultaneous jobs• monitored via Grid Operations

Centre (RAL)

CPUs FreeCPUs

RunJobs

WaitJobs

Avail TB Used TB Max CPU

Ave.CPU

Total 8049 1439 6610 8733 3558.47 1273.86 9148 6198

http://goc.grid.sinica.edu.tw/gstat/

(hyperthreading enabled on some sites)

Page 9: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Operational status (October)

Site Version Sept Version OcttotalCPU

SepttotalCPU

OctseAvail TB Sept

seAvail Oct

maxCPU Sept

MaxCPU Oct

avgCPU Sept

avgCPU Oct

1 CAVENDISH-LCG2 LCG-2_1_1 LCG-2_1_1 11 11 1.59 0.01 11 11 10 102 IC-LCG2 LCG-2_2_0 LCG-2_2_0 66 66 0.4 0.4 66 66 63 653 RAL-LCG2 LCG-2_2_0 LCG-2_2_0 494 482 934.13 937.76 494 494 482 4694 BHAM-LCG2 LCG-2_2_0 LCG-2_2_0 20 22 0.03 0.03 20 22 18 205 BitLab-LCG2 LCG-2_2_0 na 1 0 0.09 0 1 1 0 06 Lancs-LCG2 LCG-2_1_1 LCG-2_1_1 24 26 1.85 1.85 27 26 23 227 LivHEP-LCG2 LCG-2_1_1 LCG-2_1_1 94 113 0.09 0.09 94 113 51 968 ManHEP-LCG2 LCG-2_1_1 LCG-2_2_0 98 98 0.09 0.09 98 98 97 939 OXFORD-01-LCG2 LCG-2_1_1 LCG-2_1_1 72 32 1.49 1.48 72 72 69 39

10 QMUL-eScience LCG-2_1_0 LCG-2_1_0 576 566 0.09 0.1 576 576 563 56511 RALPP-LCG LCG-2_2_0 LCG-2_2_0 88 88 0.13 0.26 88 88 52 8212 RHUL-LCG2 LCG-2_1_1 LCG-2_1_1 144 144 5.43 8.16 148 148 90 11413 ScotGRID-Edinburgh LCG-2_0_0 LCG-2_0_0 1 1 4.76 0.87 1 1 0 014 scotgrid-gla LCG-2_2_0 LCG-2_2_0 2 3 0 0.09 2 3 1 215 SHEFFIELD-LCG2 LCG-2_1_1 LCG-2_1_1 39 0 0.42 0.42 43 39 36 2216 UCL-CCC LCG-2_2_0 LCG-2_2_0 340 356 0.49 0.49 356 356 299 35217 UCL-HEP LCG-2_1_1 LCG-2_1_1 76 76 1.85 1.85 76 76 64 54

Page 10: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

VOs active

Site alice atlas babar cms dteam dzero hone lhcb sixt zeus TOTAL 101 CAVENDISH-LCG2 1 1 1 1 1 52 IC-LCG2 1 1 1 1 1 1 63 RAL-LCG2 1 1 1 1 1 1 1 1 1 94 BHAM-LCG2 1 1 1 1 1 1 65 BitLab-LCG2 1 1 1 1 1 56 Lancs-LCG2 1 1 1 1 1 57 LivHEP-LCG2 1 1 1 1 48 ManHEP-LCG2 1 1 1 1 1 1 1 79 OXFORD-01-LCG2 1 1 1 1 1 5

10 QMUL-eScience 1 1 1 1 1 1 1 711 RALPP-LCG 1 1 1 1 1 1 1 1 1 912 RHUL-LCG2 1 1 1 1 1 1 613 ScotGRID-Edinburgh 1 1 1 314 scotgrid-gla 1 1 1 1 415 SHEFFIELD-LCG2 1 1 1 1 1 1 616 UCL-CCC 1 1 1 1 1 1 617 UCL-HEP 1 1 2

TOTAL 13 17 5 13 17 3 2 16 6 3

Page 11: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Who is directly involved?

Number Position Status

General

1 Production manager In place and engaged

1 Applications expert Identified but not formally engaged

2 Tier-1 /deployment expert In place and fully engaged

4 Tier-2 coordinators In place and functioning well

0.5 VO management Will be part time but not yet in place

9.0 Hardware support Post allocated but not yet filled

Specialist

1 Data and storage management Existing expert

1 Work load management Existing expert

1 Security officer Not yet recruited

1 Networking Starting in September

Page 12: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Past upgrade experience at RAL

CSF Linux CPU Use 2001-02

0

20000

40000

60000

80000

100000

120000

140000

Jan

-0

1

Feb

-0

1

Mar-

01

Ap

r-0

1

May-

01

Jun

-0

1

Jul-

01

Sep

-0

1

Oct-

01

No

v-

01

Dec-

01

Jan

-0

2

Feb

-0

2

Ap

r-0

2

May-

02

Jun

-0

2

Jul-

02

Au

g-

02

cpu

Previously utilisation of new resources grew steadily over weeks or months.

Page 13: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Tier-1 update 27-28th July 2004

Hardware Upgrade

With the Grid we see a much more rapid utilisation of newly deployed resources.

Page 14: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

The infrastructure developed in EDG/GridPP1

Job submissionPython – default

Java – GUIAPIs (C++,J,P)

Batch workers

StorageElement

Gatekeeper (PBS Scheduler)

GridFTP Server

NFS, Tape, Castor

User Interface

ComputingElement

Resource broker(C++ Condor MM libraries, Condor-G for submission)

Replica catalogue per VO (or equiv.)

Berkely DatabaseInformation Index

AA server(VOMS)

UIJDL

Logging & Book keepingMySQL DB – stores job state info

Page 15: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Common Grid Components

• LCG uses middleware common to other Grid Projects.– VDT (v1.1.14)

• Globus Gatekeeper.• Globus MDS.• GlueCE Information Provider.• Used by NGS, Grid3 and Nordugrid.

• Preserving this core increases chances of inter grid interoperability.

Page 16: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Extra Grid Components

• LCG extends VDT with fixes and the deployment of other grid services.

• This is only done when there is a shortfall or performance issue with the existing middleware.

• Most are grid wide services for LCG rather than extra components for sites to install.– Minimise conflicts between grids.– Not always true – see later.

Page 17: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

LCG PBSJobManager

• Motivation– Standard Globus JobManager starts one

perl process per job, queued or running.• One user can destroy a Gatekeeper easily.

– Also assumes a shared /home file system is present.• Not scalable to 1000s of nodes.• NFS a single failure point.

– The Resource Broker must poll jobs indivdually.

Page 18: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

LCG PBSJobManager

• Solution– LCG jobmanager stages files to batch

worker with scp and GridFTP.

• Creates new problems though.– Even harder to debug and there is more

to go wrong.– MPI jobs more difficult though an rsync

work around exists.

Page 19: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

LCG PBSJobManager

• Solution– JobManager starts up a “GridMonitor” on

the gatekeeper. – One GridMonitor per Resource Broker is

started currently.– Resource Broker communicates with the

monitor instead of polling jobs individually.– Moving this to one GridMonitor per user is

possible.

• Currently deployed at almost all GridPP sites.

Page 20: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Storage in LCG

• Currently there are three active solutions.– GridFTP servers, the so called ClassicSE– SRM interfaces at CERN, IHEP(Russia), DESY

and RAL (this week).– edg-se – Only one as a front end the atlas

data store tape system at RAL.

• The edg-rm and lcg-* commands abstract the end user from these interfaces.

Page 21: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Storage - SRM

• SRM = Storage Resource Manager.• Motivation

– Sites need to move files around and reorganise data dynamically.

– The end user wants/requires a consistent name space for their files.

– End users want to be able to reserve space this space as well.

• SRM will in time be the preferred solution supported within LCG.

Page 22: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

SRM Deployment

• Current storage solution for LCG is dCache with an SRM interface. Produced by DESY and FNAL.

• This is currently deployed at RAL in a test state and is slipping into production initially for the CMS experiment.

• Expectation is that dCache with SRM will provide a solution for many sites.– Edinburgh, Manchester, Oxford all keen to

deploy.

Page 23: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

SRM/dCache at RAL

Page 24: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Resource Broker

• Allows selection of and submission to sites based on what they publish into the information system.

• Queues are published with– Queue lengths– Software available.– Authorised VOs or individual DNs.

• The RB can query the replica catalogue to run at a site with a particular file.

• Three RBs are deployed in the UK.

Page 25: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

L&B

• L&B = Logging and Bookkeeping Service

• Jobs publish their Grid State to L&B.– Either by calling commands installed on

batch worker.– Or by GridFTP’ing the job wrapper back.

• The second requires no software on batch workers but the first gives better feedback.

Page 26: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Application Installation with

LCG• Currently a sub VO of software

managers owns an NFS mounted space.– Software area managed by jobs.– Software validated in process.– The drop a status file on to the file which is

published by the site.• With the RB

– End users match jobs to tagged sites.– SW managers install SW at non tagged

sites.• This is being extended to allow DTEAM

to install grid clients SW on WNs.

Page 27: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

R-GMA

• Developed by GridPP within both EDG and now EGEE.

• Takes the role of a grid enabled SQL database.

• Example applications include CMS and D to publish their job bookkeeping.

• Can also be used to transport the Glue values and allows SQL lookups of Glue.

• R-GMA is deployed at most UK HEP sites.• RAL currently runs the single instance of the

R-GMA registry.

Page 28: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Next LCG Release

• LCG 2_3_0 is due now.– Built entirely on SL3 (RHE3 clone).

• RH73 still an option.

– Many stability improvements.– Addition of accounting solution.– Easier addition of VOs.– Addition of DCache/SRM.

• and lots more…

• This release will last into next year.• Potentially the last release before gLite

components appear.

Page 29: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

There are still challenges

• Middleware validation

• Meeting experiment requirements with the Grid

• Distributed file (and sub-file) management

• Experiment software distribution

• Production accounting

• Encouraging an open sharing of resources

• Security

• Smoothing deployment and service upgrades.

Page 30: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Middleware validation

CERTIFICATIONTESTING

Integrate

BasicFunctionality

Tests

Run testsC&T suitesSite suites

RunCertification

Matrix

Releasecandidate

tag

APPINTEGR

Certifiedrelease

tag

DE

VE

LO

PM

EN

T &

INT

EG

RA

TIO

NU

NIT

& F

UN

CT

ION

AL

TE

ST

ING

DevTag

JRA1

HEPEXPTS

BIO-MED

OTHERTBD

APPSSW

Installation

DE

PL

OY

ME

NT

PR

EP

AR

AT

ION

Deploymentrelease

tag

DEPLOY

SA1

SERVICES

PR

E-P

RO

DU

CT

ION

PR

OD

UC

TIO

N

Productiontag

Is starting to be addressed through a Certification and Testing testbed…RAL is involved with both JRA1 and Pre Production systems.

Page 31: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

• ATLAS Data Challenge to validate world-wide computing model

• Packaging, distribution and installation: Scale:one release build takes 10 hours produces 2.5 GB of files

• Complexity: 500 packages, Mloc, 100s of developers and 1000s of users– ATLAS collaboration

is widely distributed:140 institutes, all wanting to use the software

– needs ‘push-button’ easy installation..

Physics Models

Monte Carlo Truth DataMonte Carlo Truth Data

MC Raw DataMC Raw Data

Reconstruction

MC Event Summary DataMC Event Summary Data MC Event Tags MC Event Tags

Detector Simulation

Raw DataRaw Data

Reconstruction

Data Acquisition

Level 3 trigger

Trigger TagsTrigger Tags

Event Summary Data

ESD

Event Summary Data

ESD Event Tags Event Tags

Calibration DataCalibration Data

Run ConditionsRun Conditions

Trigger System

Step 1: Monte Carlo

Data Challenges

Step 1: Monte Carlo

Data Challenges

Step 2: Real DataStep 2: Real Data

Software distribution

Page 32: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Summary

• The Large Hadron Collider data volumes make Grid computing a necessity

• GridPP1 with EDG developed a successful Grid prototype

• GridPP members have played a critical role in most areas – security, work load management, information systems, monitoring & operations.

• GridPP involvement continues with the Enabling Grids for e-SciencE (EGEE) project – driving the federating of Grids

• As we move towards a full production service we face many challenges in areas such as deployment, accounting and true open sharing of resources

Page 33: GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Useful links

GRIDPP and LCG:

• GridPP collaborationhttp://www.gridpp.ac.uk/

• Grid Operations Centre (inc. maps)http://goc.grid-support.ac.uk/

• The LHC Computing Gridhttp://lcg.web.cern.ch/LCG/

Others

• PPARChttp://www.pparc.ac.uk/Rs/Fs/Es/intro.asp

• The EGEE projecthttp://cern.ch/egee/

• The European Data Grid final reviewhttp://eu-datagrid.web.cern.ch/eu-datagrid/