Upload
nedra
View
27
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Outline. Part I: Introduction (Pedro A.) Part II: Technical Solutions (Massimo P., Benjamin F.) Transport Long Term Repository Analytics Visualization Part III: Experience by Services (Stefano Z., Spyros L.) OpenStack Monitoring Batch LSF Monitoring. History. 2012 ITTF slide. - PowerPoint PPT Presentation
Citation preview
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Computing Facilities
IT Monitoring
IT/CF
IT/OIS – IT/PES
IT Technical Forum
11th October 2013
Computing Facilities
2
Part I: Introduction (Pedro A.)
Part II: Technical Solutions (Massimo P., Benjamin F.)– Transport– Long Term Repository– Analytics– Visualization
Part III: Experience by Services (Stefano Z., Spyros L.) – OpenStack Monitoring– Batch LSF Monitoring
Outline
Computing Facilities
3
Motivation– Several independent monitoring activities in IT
• similar overall approach, different tool-chains, similar limitations
– High level services are interdependent • combination of data from different groups necessary, but difficult
– Understanding performance became more important • requires more combined data and complex analysis
– Move to a virtualized dynamic infrastructure • comes with complex new requirements on monitoring
Challenges– Find a shared architecture and tool-chain components
while preserving our investment in monitoring
History
2012 ITTFslide
Computing Facilities
4
Architecture
ApplicationApplication
FeedFeed
Analysis
Long Term Repository
Analysis
Long Term Repository
FeedFeed
NotificationsNotificationsVisualizationVisualization
FeedFeed
ProducerProducer ProducerProducer ProducerProducer ProducerProducer
TransportTransport
Computing Facilities
5
Adopt open source tools – For each architecture block look outside for solutions– Large adoption and strong community support– Fast to adopt, test, and deliver– Easily replaceable by other (better) future solutions
Integrate with new CERN infrastructure– AI project, OpenStack, Puppet, Roger, etc.
Focus on simple adoption (e.g. puppet modules)
Strategy
Computing Facilities
6
Technology
Part II
Computing Facilities
7
Community
Same technologies being used by different teams– HDFS: lemon, syslog, openstack, batch, security, castor– ES: lemon, syslog, openstack, batch– Kibana: lemon, syslog, openstack, batch
CF (lemon, syslog)
Part III
Part II Transport
Computing Facilities
9
Scalable transport needed– Collect operations data
• lemon metrics and syslog• 3rd party applications
Easy integration with providers/consumers
Apache Flume
Motivation
Computing Facilities
10
Distributed service for collecting large amounts of data – Robust and fault tolerant– Horizontally scalable– Many ready to be used input/output plugins– Java based– Apache license
Cloudera is the main contributor– Using their releases– Less frequent but more stable releases
Flume
Computing Facilities
11
Flume event– Byte payload + set of string headers
Flume agent– JVM process hosting “source -> sink” flow(s)
Data Flow
Computing Facilities
12
Many ready-to-be-used plugins
Sources– Avro, Thrift, JMS, Spool dir, Syslog, HTTP, …– Custom sources can be easily implemented
• we do have a dirq source for our use case
Interceptors– Decorate events, filter events
Sources and Sinks
Computing Facilities
13
Many ready-to-be-used plugins
Channels– Memory, File, JDBC– Custom channels can be easily implemented
Sinks– Avro, Thrift, ElasticSearch, Hadoop HDFS & HBase, Java
Logger, IRC, File, Null– Custom sinks can be easily implemented
Sources and Sinks
Computing Facilities
14
Fan-in and fan-out– Enable load balancing
Contextual routing– Based on logic implemented through selectors
Multi-hops flows– Enable layered topologies– Increase reliability, failure resistance
Other Features
Computing Facilities
15
Routing is static– On demand subscriptions are not possible– Requires reconfiguration and restart
No authentication/authorization features– Secure transport available
Java process on client side– Small memory footprint would be nicer
Limitations
Computing Facilities
16
Our Deployment
Computing Facilities
17
Producers– All Puppet nodes– Lemon, Syslog, 3rd party applications
Gateway routing layer– 10 VMs behind DNS load balancer
Elasticsearch sink– 5 VMs behind DNS load balancer – Inserting to ElasticSearch
Hadoop HDFS sink– 5 VMs behind DNS load balancer – Inserting to Hadoop HDFS
Our Deployment
Computing Facilities
18
Needs tuning to correctly size flume layers
Available sources and sinks saved a lot of time
Feedback
Part IILong Term Repository
Computing Facilities
20
Store operations raw data– Long term archival required
Allow future data replay to other tools– Feed real-time engine
Offline processing of collected data– Security data? Syslog data?
Apache Hadoop/HDFS
Motivation
20
Computing Facilities
21
Framework that allows the distributed processing of large data sets across clusters
HDFS is a distributed filesystem designed to run on commodity hardware
– Suitable for applications with large data sets– Designed for batch processing rather than interactive use– High throughput preferred to low latency access
Apache Hadoop
Computing Facilities
22
Small files not welcome– Blocks of 64M,128M
Tens of millions files limit per cluster– Namenode holding in memory files map
Transparent compression not available– Raw text could take much less space
Real-time data access is not possible
Limitations
22
Computing Facilities
23
Cluster provided by IT/DSS– ~500 TB, 13 data nodes
Data stored by hostgroup– Total 1.8 TB since mid July 2013
Daily jobs to aggregate data by month– Large files preferred to many small files by HDFS
Our Usage
23
Part II Analytics
Computing Facilities
25
Real-time queries, clear API
Limited data retention
Multiple scopes technologies
Horizontally scalable and easy to deploy
ElasticSearch
Motivation
25
Computing Facilities
26
Distributed RESTful search and analytics engine
ElasticSearch
26
Computing Facilities
27
Real time– Acquisition: data is indexed in real time– Analytics: explore, understand your data
ElasticSearch
Computing Facilities
28
Schema free– No prior data declaration required
• but possible, to optimize
– Data is injected as-is– Automatic data type discovery
Document oriented (JSON)
ElasticSearch
Computing Facilities
29
Full text search– Apache Lucene is used to provide full text search
• lucene apache documentation
– But not only text• integer/long• float/double• boolean• date• binary• ...
ElasticSearch
Computing Facilities
30
High availability– Shards and replicas auto balanced
RESTful JSON API
ElasticSearch
[root@es-search-node ~] $ curl -XGET http://localhost:9200/_cluster/health?pretty=true{ "cluster_name" : "itmon-es", "status" : "green", "timed_out" : false, "number_of_nodes" : 11, "number_of_data_nodes" : 8, "active_primary_shards" : 2990, "active_shards" : 8970, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0}
Computing Facilities
31
Used by many large companies– Soundcloud
“To provide immediate and relevant results for their online audio distribution platform reaching180 million people”
– Github “20TB of data using ElasticSearch, including 1.3 billion files and 130 billion lines of code”
– Foursquare, Stackoverflow, Salesforce, ...
Distributed under Apache license
ElasticSearch
Computing Facilities
32
Requires a lot of RAM (Java)– Especially on data nodes
IO intensive– Take into account when planning deployment
Shards re-initialisation takes some time (~1h)– Lots of shards and replicas per index, lots of indexes– Not frequent operation, only after full cluster reboot
Authentication not built-in (“bricolage”)– Apache+Shibboleth on top of Jetty plugin
Limitations
Computing Facilities
33
Our Deployment
Fully puppetized
Production cluster– 2 master nodes (no data)
• 16GB RAM, 8 cores CPU
– 1 search node (no data)• 16GB RAM, 8 cores CPU
– 8 data nodes• 48GB RAM, 24 cores CPU • 500GB SSD
Development cluster – Based on medium
and large VMs
Computing Facilities
34
Our Deployment
Security: Jetty plugin– Access control, SSL (also requests logging, Gzip)
Monitoring: many plugins– ElasticHQ, BigDesk, Head, Paramedic, ...
Computing Facilities
35
1 index per day– flume-lemon-YYYY-MM-DD– flume-syslog-YYYY-MM-DD
90 days TTL
10 shards per index
2 replicas per shards
Our Deployment
Computing Facilities
36
Production Cluster– ElasticHQ– HEAD
Demo
Computing Facilities
37
Easy to deploy and manage
Robust, fast, and rich API
Easy query language (DSL)– More features coming with aggregation framework
Feedback
Part II Visualisation
Computing Facilities
39
Dedicated, dynamic and user-friendly dashboards
Horizontally scalable and easy to deploy
Kibana
Motivation
Computing Facilities
40
Visualize time-stamped data from ElasticSearch
Kibana
Computing Facilities
41
“Make sense of a mountain of logs”– Designed to analyze log– Perfectly fits timestamped data (e.g. metrics)
Profit from ElasticSearch power– Search/analyze features exploited
Kibana
Computing Facilities
42
No code required– Simply point & click to build your own dashboard
Kibana
Computing Facilities
43
Open source, community driven– Now fully integrated and supported by ElasticSearch– Provided code/feature contribution
Kibana
Computing Facilities
44
Built with AngularJS– JavaScript MVC framework for client-side rich application– Developed and maintained by Google– No backend: web server delivers only static files
• JS directly queries ElasticSearch
Kibana
Computing Facilities
45
Easy to install– “git clone” OR “tar -xvzf” OR ElasticSearch plugin
Easy to configure– 1-line config file to point to the ElasticSearch cluster– Save its own configuration in ElasticSearch itself
• Possible to export/import dashboards configuration
Kibana
Computing Facilities
46
Based on ElasticSearch plugin– To profit from Jetty authentication– Deployed together with search node
Public (read only) and private (read write) endpoints
Our Deployment
Computing Facilities
47
Production Dashboards– Syslog– Lemon– PDUs
Demo
Computing Facilities
48
Easy to deploy and use
Cool user interface
Fits many use cases– Text (syslog), metrics (lemon)
Still limited feature set– Under active development
Very active community and growing
Feedback
Part III OpenStackMonitoring
Operating Systems & Information Services
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
OIS
Experience with OpenStack
Operating Systems & Information Services
Cloud Infrastructure
• CERN Cloud Infastructure:– Based on OpenStack– Consist of several services (Keystone, Nova, Glance, Cinder, …)
• Production deployment:– 664 nodes– 20224 cores– ~ 1800 VMs– 100 new servers per week– 15000 nodes by end-2015
Operating Systems & Information Services
Cloud Infrastructure
Main requirements: – Have a centralized copy of our logs to ease problem
investigation– Display OpenStack usage statistics– Display the results of OpenStack functional tests– Maintain a long term history of the status of our
infrastructure– Monitor the usage and status of our databases
Operating Systems & Information Services
Architecture
Operating Systems & Information Services
Architecture
Elasticsearch configuration:– 14 nodes cluster
• 11 data nodes• 3 http nodes
– Fully puppetized– Fully running on virtual machines– Different indices based on the source– New index every day
• 5 shards per index• 2 replicas per shard ( 1 master + 2 replicas )• Custom index templates automatically applied at creation time
– Elasticsearch nodes accept external connections only from Flume gateways and Kibana nodes
Operating Systems & Information Services
Architecture
Kibana configuration:– 3 Kibana nodes– Shibboleth authentication– Reverse proxy for Kibana search queries
Operating Systems & Information Services
Example
OpenStack Nova API Dashboard:- Based on logs- Several information about the service status
Operating Systems & Information Services
Example
Operating Systems & Information Services
Conclusions
Overall good experience:– We have dashboard for every service (~25)– We have a long term history of the status of our
infrastructure– In case of disaster we can recover all the data from HDFS
Some drawbacks:– It took some effort to configure and fine tune Flume and
Elasticsearch on virtual machines– Kibana needs improvements
Part III Batch LSFMonitoring
Platform & Engineering Services
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
PES
Batch 2nd line support tool
IT-PES-PS
Technical student: Spyros Lalos Supervisor: Jérôme Belleman
Cern Batch Service
• 4000 servers with over 50000 CPU cores
• 400000 jobs/day• Running LSF• Fairshare scheduling• High query load on the system
•
We use various monitoring tools:•SLS (Service Level Status)•Lemon pages•OpenTSDB
OpenTSDB
Metric :Batchjobs.runningtags :host=batchmon10
tags in Java API are manipulated as hashmaps
Batch monitoring
GOAL:
• Tool enabling 2nd line support for end users• More specific/relevant information display
with Kibana• Kibana dashboards opened to Helpdesk,
users• No need for engineer-level personnel
replying to requests
Project Dataflow
ElasticSearch/Kibana
• 1 index per daytest-YYYY-MM-DD
• 5 shards per index• 1 replica per shard
Extended capabilities:
1. AuthN and AuthZ 2. Url configuration•Query set via URL parameters•Web frontpage for users – access to custom dashboards
Kibana
Kibana