Upload
alisa
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
WLCG infrastructure monitoring proposal. Pablo Saiz IT/SDC/MI 16 th August 2013. Table of contents. Summary of the progress Desired structure of applications Proposal for infrastructure monitoring. Summary. Motivation. Reduction on number of people Redefining scope of applications - PowerPoint PPT Presentation
Citation preview
IT-SDC : Support for Distributed Computing
WLCG infrastructure monitoring proposal
Pablo SaizIT/SDC/MI
16th August 2013
Infrastructure monitoring P. Saiz 2IT-SDC 16 August 2013
Table of contents
I. Summary of the progressII. Desired structure of applicationsIII. Proposal for infrastructure
monitoring
Infrastructure monitoring P. Saiz 3IT-SDC
I.Summary
16 August 2013
Infrastructure monitoring P. Saiz 4IT-SDC 16 August 2013
Motivation
Reduction on number of people Redefining scope of applications Combining expertise Step out and evaluate other
alternatives Goal:
Offer (at least) same QoS with less resources
Infrastructure monitoring P. Saiz 5IT-SDC 16 August 2013
Status so far
WLCG monitoring consolidation group created
Applications supported by the section Applications used … so now we know what to provide
Infrastructure monitoring P. Saiz 6IT-SDC 16 August 2013
How to provide it
Visualization Documentation Deployment Recurrent tasks
Input from our experience Input from other groups What is available out there
Split in different areas of work Source of Information Transport Storage Aggregation
Review of the areas
Infrastructure monitoring P. Saiz 7IT-SDC
II. Structure of applications
16 August 2013
Infrastructure monitoring P. Saiz 8IT-SDC 16 August 2013
Different layers of applications
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
Infrastructure monitoring P. Saiz 9IT-SDC
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
16 August 2013
Deployment
Using openstack, puppet, hiera, foreman Quota of 100 nodes, 240 cores Multiple templates already created
Development machine (7 nodes) Web servers (SSB, SUM, WLCG transfers, Job: 16 nodes) Elastic Search (6 nodes), Hadoop (4 nodes)
Currently working on nagios installation Migrating machines from quattor to AI Koji and Bamboo for build system and
continuous integrationDeployment
Infrastructure monitoring P. Saiz 10IT-SDC
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
16 August 2013
Source of information
Gather info from external, internal sources.
Publish it in the transport layer
Collect information
Nagios
GOCDB
REBUS
OIM
Savannah
Other app
Infrastructure monitoring P. Saiz 11IT-SDC
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
16 August 2013
Transport
Message Broker Local files HTTP PUT/GET UDP (table in DB)?
Tran
spor
t
Infrastructure monitoring P. Saiz 12IT-SDC
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
16 August 2013
Storage
StorageArchival
Current Metrics
Metadata
• Accepts any data• #jobs, status of a service,
downtime, pledges, channel status
• Metric, Instance, Time Range, Value
• Archival• Long term data• (Same format as Metric
Storage)?• Current Metrics
• Most common views• Metadata
• Profiles• Topology
Infrastructure monitoring P. Saiz 13IT-SDC
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
16 August 2013
Aggregation
Treated as another metric
Might collect input from previous metrics
Current schema of ‘CMS Site readiness’
Summary
Site readiness
Availability
Aggregate
Infrastructure monitoring P. Saiz 14IT-SDC
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
16 August 2013
Visualize
Visualization
• Server:• HTML skeleton• REST API with JSON data• Cache: memcache, varnish
• Client• Common library + plugin
• jQuery• Common MVC
• No obvious choice…• Plots (Interactive,
Exportable, Embeddable)• Highcharts
Infrastructure monitoring P. Saiz 15IT-SDC
III. Infrastructure monitoring
16 August 2013
Infrastructure monitoring P. Saiz 16IT-SDC 16 August 2013
Current situation
Big system, difficult to maintain/evolve Many internal dependencies Multiple schemas, aggregations:
SSB, MRS, ACE Scope much bigger than what we need
Limit to WLCG Usage of probes
Does not test what the experiments are doing! Non-trivial deployment of new tests Based on technologies available at the time of the design New requests from experiments:
Test whatever they want Availability vs Usability
Combine Dashboard/SAM apps
Infrastructure monitoring P. Saiz 17IT-SDC
Infrastructure monitoring
16 August 2013
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
Nagios Pledge
Down Pilot
HC VO feed
MyWLCG
SSB SUM
Trend Report
ACE
POEM
Archival
Metrics
Infrastructure monitoring P. Saiz 18IT-SDC
And for the prototype…
16 August 2013
Collect information
Tran
spor
t
Storage Visualize
Aggregate
Recurrent Tasks
Documentation
Deployment
Nagios Pledge
Down Direct
HC VO feed
MyWLCG
SSB SUM
Trend Report
ACE
POEM
Archival
MetricsMetrics
SSB Storage Records status
changes Same procedure as
any other metric
New Data
Processed Data
consume2db
SSB formatSimplified MRS Accepts any data
No foreign keys! No status calculation 300K messages per
day
All the data in storage have the same format: Instance, Metric, Time range, Value Source could be nagios, pilot framework, VO-defined
metrics, availabilities
Infrastructure monitoring P. Saiz 19IT-SDC 16 August 2013
And now we can see metrics…
14 August 2013Infrastructure monitoring P. Saiz 19
Infrastructure monitoring P. Saiz 20IT-SDC 16 August 2013
Aggregation
Combination of ACE +SSB Virtual Columns Two types:
Horizontal: Ins1 (M1…Mn) Ins1 (Mp) Vertical: M1 (Ins1…Insn) Insp (M2)
Initial options for “and”, “or” of current status Later on, might be extended to ‘sliding window’
Full description
Infrastructure monitoring P. Saiz 21IT-SDC 16 August 2013
Examples of aggregation
ATLAS_CRITICALWN Site
(expand this column)
Infrastructure monitoring P. Saiz 22IT-SDC
Summary
16 August 2013
Lots of progress towards unified schema Data can be published from different sources
Nagios, VO-defined metrics, ACE, (HC, Job Pilots) Single schema for storage Components talk to each other through API Getting close to a “proof of concept”
Aggregation needs some work Visualization might need adjusting
Other tasks can go in parallel NoSQL evaluation Nagios configuration
Only active metrics