22
IT-SDC : Support for Distributed Computing WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013

WLCG infrastructure monitoring proposal

  • Upload
    alisa

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

WLCG infrastructure monitoring proposal. Pablo Saiz IT/SDC/MI 16 th August 2013. Table of contents. Summary of the progress Desired structure of applications Proposal for infrastructure monitoring. Summary. Motivation. Reduction on number of people Redefining scope of applications - PowerPoint PPT Presentation

Citation preview

Page 1: WLCG infrastructure monitoring  proposal

IT-SDC : Support for Distributed Computing

WLCG infrastructure monitoring proposal

Pablo SaizIT/SDC/MI

16th August 2013

Page 2: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 2IT-SDC 16 August 2013

Table of contents

I. Summary of the progressII. Desired structure of applicationsIII. Proposal for infrastructure

monitoring

Page 3: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 3IT-SDC

I.Summary

16 August 2013

Page 4: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 4IT-SDC 16 August 2013

Motivation

Reduction on number of people Redefining scope of applications Combining expertise Step out and evaluate other

alternatives Goal:

Offer (at least) same QoS with less resources

Page 5: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 5IT-SDC 16 August 2013

Status so far

WLCG monitoring consolidation group created

Applications supported by the section Applications used … so now we know what to provide

Page 6: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 6IT-SDC 16 August 2013

How to provide it

Visualization Documentation Deployment Recurrent tasks

Input from our experience Input from other groups What is available out there

Split in different areas of work Source of Information Transport Storage Aggregation

Review of the areas

Page 7: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 7IT-SDC

II. Structure of applications

16 August 2013

Page 8: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 8IT-SDC 16 August 2013

Different layers of applications

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

Page 9: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 9IT-SDC

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

16 August 2013

Deployment

Using openstack, puppet, hiera, foreman Quota of 100 nodes, 240 cores Multiple templates already created

Development machine (7 nodes) Web servers (SSB, SUM, WLCG transfers, Job: 16 nodes) Elastic Search (6 nodes), Hadoop (4 nodes)

Currently working on nagios installation Migrating machines from quattor to AI Koji and Bamboo for build system and

continuous integrationDeployment

Page 10: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 10IT-SDC

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

16 August 2013

Source of information

Gather info from external, internal sources.

Publish it in the transport layer

Collect information

Nagios

GOCDB

REBUS

OIM

Savannah

Other app

Page 11: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 11IT-SDC

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

16 August 2013

Transport

Message Broker Local files HTTP PUT/GET UDP (table in DB)?

Tran

spor

t

Page 12: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 12IT-SDC

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

16 August 2013

Storage

StorageArchival

Current Metrics

Metadata

• Accepts any data• #jobs, status of a service,

downtime, pledges, channel status

• Metric, Instance, Time Range, Value

• Archival• Long term data• (Same format as Metric

Storage)?• Current Metrics

• Most common views• Metadata

• Profiles• Topology

Page 13: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 13IT-SDC

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

16 August 2013

Aggregation

Treated as another metric

Might collect input from previous metrics

Current schema of ‘CMS Site readiness’

Summary

Site readiness

Availability

Aggregate

Page 14: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 14IT-SDC

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

16 August 2013

Visualize

Visualization

• Server:• HTML skeleton• REST API with JSON data• Cache: memcache, varnish

• Client• Common library + plugin

• jQuery• Common MVC

• No obvious choice…• Plots (Interactive,

Exportable, Embeddable)• Highcharts

Page 15: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 15IT-SDC

III. Infrastructure monitoring

16 August 2013

Page 16: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 16IT-SDC 16 August 2013

Current situation

Big system, difficult to maintain/evolve Many internal dependencies Multiple schemas, aggregations:

SSB, MRS, ACE Scope much bigger than what we need

Limit to WLCG Usage of probes

Does not test what the experiments are doing! Non-trivial deployment of new tests Based on technologies available at the time of the design New requests from experiments:

Test whatever they want Availability vs Usability

Combine Dashboard/SAM apps

Page 17: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 17IT-SDC

Infrastructure monitoring

16 August 2013

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

Nagios Pledge

Down Pilot

HC VO feed

MyWLCG

SSB SUM

Trend Report

ACE

POEM

Archival

Metrics

Page 18: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 18IT-SDC

And for the prototype…

16 August 2013

Collect information

Tran

spor

t

Storage Visualize

Aggregate

Recurrent Tasks

Documentation

Deployment

Nagios Pledge

Down Direct

HC VO feed

MyWLCG

SSB SUM

Trend Report

ACE

POEM

Archival

MetricsMetrics

SSB Storage Records status

changes Same procedure as

any other metric

New Data

Processed Data

consume2db

SSB formatSimplified MRS Accepts any data

No foreign keys! No status calculation 300K messages per

day

All the data in storage have the same format: Instance, Metric, Time range, Value Source could be nagios, pilot framework, VO-defined

metrics, availabilities

Page 19: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 19IT-SDC 16 August 2013

And now we can see metrics…

14 August 2013Infrastructure monitoring P. Saiz 19

Page 20: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 20IT-SDC 16 August 2013

Aggregation

Combination of ACE +SSB Virtual Columns Two types:

Horizontal: Ins1 (M1…Mn) Ins1 (Mp) Vertical: M1 (Ins1…Insn) Insp (M2)

Initial options for “and”, “or” of current status Later on, might be extended to ‘sliding window’

Full description

Page 21: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 21IT-SDC 16 August 2013

Examples of aggregation

ATLAS_CRITICALWN Site

(expand this column)

Page 22: WLCG infrastructure monitoring  proposal

Infrastructure monitoring P. Saiz 22IT-SDC

Summary

16 August 2013

Lots of progress towards unified schema Data can be published from different sources

Nagios, VO-defined metrics, ACE, (HC, Job Pilots) Single schema for storage Components talk to each other through API Getting close to a “proof of concept”

Aggregation needs some work Visualization might need adjusting

Other tasks can go in parallel NoSQL evaluation Nagios configuration

Only active metrics