24
EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Maite Barroso SA1 activity leader CERN EGEE-III First Review, 24-25 June, 2009 Grid Operations SA1 Status Report

Grid Operations SA1 Status Report

  • Upload
    amable

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Grid Operations SA1 Status Report. Maite Barroso SA1 activity leader CERN EGEE-III First Review, 24-25 June, 2009. SA1 Activity Overview. 28 countries, 175 FTE. Grid Operations. Reliable , multi-VO, large scale production infrastructure Uninterrupted service - PowerPoint PPT Presentation

Citation preview

Page 1: Grid Operations SA1 Status Report

EGEE-III INFSO-RI-222667

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

Maite BarrosoSA1 activity leaderCERN

EGEE-III First Review, 24-25 June, 2009

Grid OperationsSA1 Status Report

Page 2: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso- EGEE-III First Review 24-25 June 2009 2

SA1 Activity Overview

Country Total PM

planned at M24 (1)

Total FTE

Austria 37 1.5

Belgium

Bulgaria 60 2.5

CERN 420 17.5

Croatia 47 2.0

Cyprus 47 2.0

Czech Republic 58 2.4

Finland 24 1.0

France 450 18.8

Germany 392 16.3

Greece 131 5.5

Hungary 38 1.6

Ireland 36 1.5

Israel 52 2.2

Italy 468 19.5

Netherlands 204 8.5

Norway

Poland 152 6.3

Portugal 100 4.2

Romania 57 2.4

Russia 424 17.7

Serbia 55 2.3

Slovakia 33 1.4

Slovenia 16 0.7

Spain 317 13.2

Sweden 120 5.0

Switzerland 24 1.0

Turkey 66 2.8

UK 372 15.5

Total PM planned at M24 4200

Total FTE 175.0

28 countries, 175 FTE

NA12%

NA25% NA3

8%

NA419%

NA51%

SA149%

SA22%

SA39%

JRA16%

Page 3: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 3

Grid Operations• Reliable, multi-VO, large scale production

infrastructure• Uninterrupted service• Operational processes, tools and documentation• Worldwide collaboration between ROCs and sites

Page 4: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 4

Size of the infrastructureNumber of EGEE-III certified sites

Number of EGEE-III certified sites per region

Computing resources:• 155 MSI2k at the end of

January 2009• already more than the 124

MSI2k planned for the end of the project!

Storage resources:• Currently deployed information

providers have known issues, unreliable data

• Ongoing initiative, started by WLCG, to review and fix them

• Foreseen for Y2

Page 5: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 5

Usage of the infrastructure (I)Monthly production normalized CPU time by VO

Number of EGEE-III certified sites per region

• Remarkable increase in the usage of the grid resources

Monthly production normalized CPU time by ROC

• Steady increase in the usage of the grid resources by most VOs

• Some of the larger VOs show considerable fluctuations, due to specific challenges

• Substantial increase for some VOs: ATLAS, LHCb and CMS

Page 6: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 6

Usage of the infrastructure (II)Number of jobs

• Steadily increasing till October ‘08, stable since then

• 10 million jobs per month• 370.000 jobs/day (188.000

last year, doubled since then!)

Page 7: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 7

Usage of the infrastructure (III)Data transfers

• The bulk of the data transported can be credited to the four LHC VOs

• Peaks of data transfer activity in Spring and Summer 2008, WLCG service challenges and stress tests in preparation of the start of the operational phase of the LHC

• Slowly increasing in the last months

• Sustained data rates of more than 0.9 GB/s with peaks up to 1 GB/s

Page 8: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 8

Seed resources• Pool of compute and storage resources made available to new VOs

to ease the process of becoming a user of the EGEE e-Infrastructure (with dedicated funding)

• Resources (257 cores and 27 TB of disk space) allocated to 4 sites, with well defined usage policies, up and running since January ‘09

Metric Value VOsNumber VOs allocated to seed-resource 2 na4.vo.eu-egee.org,

eticsproject.euNumber of requests for seed-resource allocation 1 Climate-G VO

Number of jobs submitted from seed resources VOs 30150 na4.vo.eu-egee.org

eticsproject.euComputing power consumed within seed resources pool (KSI2K)

61420

na4.vo.eu-egee.orgeticsproject.eu

Disk storage used within seed resources (GB)

350

na4.vo.eu-egee.orgeticsproject.eu

Services VOs organized by their own

WMS = 1LFC = 1CE = 2SE = 2

WMS = 0LFC = 1CE = 0SE = 0

na4.vo.eu-egee.orgna4.vo.eu-egee.orgna4.vo.eu-egee.orgna4.vo.eu-egee.orgeticsproject.eueticsproject.eueticsproject.eueticsproject.eu

Page 9: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 To change: View -> Header and Footer 9

SLA Roll-out• SLAs facilitate the establishment of a partnership between

infrastructure management structures and resource centres (sites) to provide a defined quality of services to the users of resources.

• Slow but steady progress in all regions• 127 sites out of 264 (48%) have signed the SLA:

– Some ROCs sign with the national grid organizations (UKI, Italy)– Others consider equivalent the signature of the WLCG MoU (France)

• Complete set of metrics defined– Site availability/reliability is gathered automatically every month– All others gathered quarterly, from different sources, some of them not

automated– Ongoing work at CESGA to provide an operations metrics portal collecting

all metric results

Page 10: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 10

Site Availability / Reliability• Availability and reliability targets are defined in the

EGEE ROC-Site SLA (70% Availability, 75% Reliability)• Results published monthly as the EGEE League Table

– https://edms.cern.ch/document/963325/• Systematic review of results by ROCs and SA1

management• Since May 2008, steady, albeit irregular, improvement

of overall site availability.• Discovering limitations of weighting by CPU count due to

server consolidation

Page 11: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 11

Site Availability Improvements

May 2008 April 2009

Figures show that the regular monitoring of the SAM tests results and the associated follow-up activity contributed to improve both the overall and the regional Availability and Reliability.

May 2008

Page 12: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 12

Site Availability evolution

May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08 Nov-08 Dec-08 Jan-09 Feb-09 Mar-09 Apr-0950%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

APCERNCEFranceDECHItalyNERussiaSEESWEUKIAverageEGEE Regional Availability Figures

Page 13: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 13

Release and deployment management

• Releases of new middleware must not disrupt the operational state of the production infrastructure:– incremental updates of the middleware has proved to be effective– there were nevertheless a few incidents affecting the production

system during the deployment of some updates: post-mortems carried out with SA3 for these incidents standard mechanism to roll-back a middleware upgrade staged roll-out at selected sites, to detect critical incidents as early as

possible• This goes in the direction of the future model that SA1

is putting in place: including staged roll-outs, fine grained versioning of the grid services, and a reliable production repository

Page 14: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 14

Pre-Production Service• Pilot services:

– New service: on-demand previews of new middleware functionalities to interested users

– 5 pilot services (WMS 3.1, Site Central Authorization Service (SCAS), CREAM CE, VOMS and SLC5 Worker Nodes)

– very successful, valuable to the user and operations community– Community effort based on common interests can work - with a

thin layer for planning, coordination and tracking.• Deployment testbed:

– due to improvements in certification, focus is changing– many regions undertake their own rollout tests before wide-scale

release– Will evolve into a ‘staged rollout’ composed of representative

sites from the regions that undertake the deployment of new certified software release in a timely manner

Page 15: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 15

Operational security• Day-to-day operations focused on security

incidents and vulnerabilities reported– None involved the middleware as an infection vector– No significant impact on the infrastructure

• Security "drills" early 2009 Tier1s campaign: clear overall improvement from the sites

• Cooperation with the OAT for most of the security monitoring

• Collaboration with the NRENs identified as a priority by the ROCs

– Appropriate contact points identified and appointed on both sides

– Local and global cooperation being improved• Security training and dissemination

– Full scale security training event organised at EGEE 08

• Additional gLite-specific security recommendations published

Page 16: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 16

Operational security• Software vulnerabilities

– 28 new security vulnerabilities handled by the team– Comprehensive vulnerability handing process published

• Joint Security Policy Group– New mandate adopted

Clarified the stake-holders of the group Confirmed the aim of preparing general policies for use on many Grids.

– Four policy documents were approved Approval of Certification Authorities Grid Security Traceability and Logging Policy VO Operations Policy and Policy on Grid Multi-User Pilot Jobs

• International Grid Trust Federation– Significant progress was made on policies for operation of authorization services

Page 17: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 17

Global Grid User Support• Regional support with central

coordination• GGUS is the central integration

platform, connected to other support structures (regional helpdesk, VO support infrastructures, etc)

• Users can choose to submit a support request to the central GGUS, to their Regional Operations Centre (ROC), or to their Virtual Organisation (VO) support service

• Support procedures are continuously updated and improved.

• Best practices are shared between supporters, and documented in a knowledge base for all grid-related problems and their solutions.

Page 18: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 18

Global Grid User Support

• Number of trouble tickets has been almost constant over time

• Not particularly affected by the increasing size of the EGEE e-Infrastructure and the number of users

• Most tickets belong to ENOC and CIC Support Units

Page 19: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 19

Grid Operator on duty• Role of oversight and 1st level support for grid production

infrastructure– Critical activity in maintaining usability and stability of sites– first-line support model based on a central group of operators on duty (COD)

opening tickets to sites in case of grid monitoring alarms• Work in EGEE III to define a new model, based on the devolution to

regions– First-line support done by each region, plus common layer for procedures, tools,

escalation– New procedures and organizational scheme have been identified according to

the requirements from existing COD teams, ROCs and sites, together with a migration work plan

– Four pilot federations have been identified: Central Europe, Northern Europe, Asia-Pacific and South West Europe.

• Expected advantages:– improvement in terms of number of tickets handled and response time – preparation to a sustainable infrastructure based on the distribution of

responsibilities to federations.

Page 20: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 20

Grid Operations Automation• Aims

– Improve reliability and availability of sites via improved operational tools

– Increase automation of operations infrastructure– Prepare operational tools for use in an EGI/NGI structure

• Operations Automation Team (OAT) with representatives from ROCs, sites, all operation tools, and related infrastructure projects

– Strategy document at PM1 outlining technical architecture to achieve these aims

– New regional operation monitoring and ticketing flows defined by COD team, and implemented by OAT tools Nagios, Regional Dashboard, GGUS

Page 21: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Operations Automation TeamFocus:• Site Monitoring via Nagios, a

commodity open-source monitoring framework

• Integration of operational tools via ActiveMQ, an open-source enterprise messaging system

Achievements:• providing sites with a ready-to-deploy

Nagios monitoring solution, which configures itself automatically and includes a reference set of grid probes

• Nagios couples grid service monitoring with local fabric monitoring

• 120 sites monitored at site• 174 sites monitored at ROCsNext Steps:• Phased release of updated operational

tools to meet the issues of a regional deployment

21

Page 22: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Regionalized operations tools• Architecture and design phase now finished• All tools have provided plans with functionality and milestones for

delivery• A set of milestone deliverables which give a complete

functionality– 3 month intervals, starting April 2009

• If timescales slip, we can stop at any of the milestones and have a functional solution– Sacrificing functionality or distribution

22

Page 23: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 23

Plans for Y2• Main goal is to transition to the operating model and

infrastructure proposed by the EGI Blueprint, for all SA1 tasks, with no disturbance to the reliable EGEE production infrastructure– Define which other tasks/roles will be regionalized, and make a

plan to achieve it– Finalize the regionalization for the tasks already identified (COD,

user support)– Finalize operation tool developments necessary to enable

regionalization, and deploy them transparently in production– Revise the software release and deployment procedure that

uses a ‘staged rollout’ as opposed to the Deployment Testbed in the current PPS

Page 24: Grid Operations SA1 Status Report

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 24

Summary • EGEE Infrastructure has continued to increase in size,

scale, usage and reliability• Distribution and automation are the driving forces• Distribution: We are gradually evolving the operations

model to move responsibility to the regions, this has an impact in effort, tools, procedures– Intense program of work for Y2– Preserving the collaboration is essential for this and for the

future EGI/NGI model• Automation: by devolving a complete solution for grid

monitoring to sites/ROCs, and a complete operations toolkit integrated through well defined interfaces and using messaging