13
EGEE-II INFSO-RI- 031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Multi-level monitoring - an overview James Casey, OAT EGEE’08 Istanbul, Turkey

Multi-level monitoring - an overview

  • Upload
    paiva

  • View
    44

  • Download
    5

Embed Size (px)

DESCRIPTION

Multi-level monitoring - an overview. James Casey, OAT EGEE’08 Istanbul, Turkey. Why are we here…. What is the Operations Automation Team (OAT). EGEE MSA1.1 : Operations Automation Strategy Due end of PM1 Delivered mid-June In review – comment welcome - PowerPoint PPT Presentation

Citation preview

Page 1: Multi-level monitoring -  an overview

EGEE-II INFSO-RI-031688

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

Multi-level monitoring- an overviewJames Casey, OAT

EGEE’08

Istanbul, Turkey

Page 2: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Why are we here…

EGEE’08 – Multi-level Monitoring 2

Page 3: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

What is the Operations Automation Team (OAT)

• EGEE MSA1.1 : Operations Automation Strategy– Due end of PM1– Delivered mid-June– In review – comment welcome

• https://edms.cern.ch/document/927171

• Abstract:In EGEE-III, within the SA1 activity, a group called the ‘Operations Automation Team’ was

formed with the task of coordinating operational tools and their development, with the specific goal of advising on the strategic directions to take in terms of automating the operations effort. This will entail replacing manual processes with automated ones in order that the overall staffing level of operations can be significantly reduced in a long-term, sustainable infrastructure.

This document outlines a strategy for achieving this automation using an integration architecture based on messaging. It describes how current tools and processes, such as operational alarming and ticketing will evolve during the lifetime of EGEE-III and lays out a roadmap for this evolution.

3EGEE’08 – Multi-level Monitoring

Page 4: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Repositories of Information

Accounting MonitoringTicket

Followup

Reporting Alarms

User Support

GOCDB, Operations Portal

APEL, Accounting

Enforcement Portal

SAM, GStatOperationsDashboard

GridViewAccounting Portal

Site Fabric Monitoring

GGUS

Operational Tools in EGEE-III

4EGEE’08 – Multi-level Monitoring

Page 5: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Current Operational Model

• Several teams involved– Operations Management (OCC)

– Monitoring system operators (SAM)

– Grid operators (COD)

– Regional Operations Centres (ROC)

– First line support teams (ROC)

– Resource Centres/sites (RC)

– User support team (GGUS)

5

RC

SAM

ROC1st Line support

COD

OCC

GGUS

RC RC RC

ROC1st Line support

ROC1st Line support

RC RC

Management

Central Operational

Teams

Regional

Site-level

EGEE’08 – Multi-level Monitoring

Page 6: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Current operational model (s)

6

Site

Operations Team (COD)

Alarm

ROC

GGUSTicket

SAM

Site

ROC

SAM

After 24 Hours

Alarms handled by the COD operator Alarms handled directly by the 1st line support

Operations Dashboard

ROC1st Line Support

TeamRegional

DashboardAlarm

Operations Team (COD)

Operations Dashboard

Alarm

GGUSTicket

GGUSTicket

EGEE’08 – Multi-level Monitoring

Page 7: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Future operational model

7

Site

Alarm from siteor regional monitoring

r-COD Team

LocalTicket

RegionalDashboard

1st Line Support

c-COD Team

CentralDashboard

Escalation

GGUSTicket

Provide supportto fix problem

EGEE’08 – Multi-level Monitoring

Page 8: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Multi-level monitoring

• Based on existing work in CE ROC– Replace central SAM with Nagios at ROC and site– Tie together with the messaging system (see later)– Regional operations dashboard and alarms DB– Link into regional ticketing

E.g., via GGUS

• Follow new operational model– Raise alarms immediately at the site– 1st level support sees them and can respond if needed– Central COD only involved after 2-3 weeks e.g. site banning

• Data is aggregated at the ROC for availability calculation

8EGEE’08 – Multi-level Monitoring

Page 9: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Multi level monitoring framework

9EGEE’08 – Multi-level Monitoring

Page 10: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Messaging for integration

• Use commodity messaging middleware (Apache ActiveMQ) to integrate systems– Reliable, scalable, industry standard, open protocols

• Broker already in production

10

Accounting Database

SAM/Gridview

Dashboards

Nagios @ ROC

Nagios @ Site

21

21

21

(… more clients…)

EGEE’08 – Multi-level Monitoring

Page 11: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Roadmap for tools

• Milestone ‘Messaging 1’: August 2008 – Production level messaging broker in production. This should have internal

failover capabilities, but will not have the WAN failover capabilities of a network of broker

• Milestone ‘Messaging 2’: December 2008 – A scalable and reliable network of brokers, consisting of a deployment over

at least 3 sites is in place• Milestone ‘Site Monitoring 1’: September 2008

– A release of the site components for the multi-level monitoring, including packaging and configuration as part of a EGEE middleware release exists and is ready for deployment to the sites.

• Milestone ‘ROC Monitoring 1’: December 2008 – The ROC components for the multi-site monitoring are ready for deployment

to sites.• Milestone ‘ROC Monitoring 2’: February 2009

– The alarm component has been integrated with the regionalized dashboard• Milestone ‘ROC Monitoring 3’: July 2009

– The regional dashboard is now available to be deployed at the ROCs

11EGEE’08 – Multi-level Monitoring

Page 12: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Roadmap for distributed COD

• Milestone ‘rCOD 1’: September 2008 – 4 ROCs carry out r-COD and 1st line support roles directly. This will

be done with a ‘regionalized’ version of the current operations dashboard, and with SAM as the alarm generation system

• Milestone ‘rCOD 2’: April 2009 – 4 additional ROCs carry out r-COD and 1st line support roles using

the regionalized dashboard• Milestone ‘rCOD 3’: April 2009

– 2 additional ROCs carry out r-COD and 1st line support roles directly using the new multi-level monitoring framework

• Milestone ‘rCOD 4’: September 2009 – All 11 ROCs carry out r-COD and 1st line support roles directly.

The c-COD is fully established• Milestone ‘rCOD 5’: December 2009

– All 11 ROCs carry out r-COD and 1st line support roles using the new multi-level monitoring framework

12EGEE’08 – Multi-level Monitoring

Page 13: Multi-level monitoring -  an overview

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Summary

• EGEE-III is moving to a new monitoring model• Key concept is that sites :

– are responsible for the reliability of their sites with the help of their ROC as 1st line support

– are provides with the tools to allow them to run reliable services Site monitoring component is provided, based on Nagios

• Part of an overall strategyhttps://edms.cern.ch/document/927171

• Since Nagios will become a core component within SA1 for administrators, we need to provide training…

• Now onto the Nagios specific bits from the experts…

EGEE’08 – Multi-level Monitoring 13