68
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/ CF Computing Facilities IT Monitoring IT/CF [email protected] IT/OIS – IT/PES IT Technical Forum 11 th October 2013

Outline

  • Upload
    nedra

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Outline. Part I: Introduction (Pedro A.) Part II: Technical Solutions (Massimo P., Benjamin F.) Transport Long Term Repository Analytics Visualization Part III: Experience by Services (Stefano Z., Spyros L.) OpenStack Monitoring Batch LSF Monitoring. History. 2012 ITTF slide. - PowerPoint PPT Presentation

Citation preview

Page 1: Outline

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

CF Computing Facilities

IT Monitoring

IT/CF

[email protected]

IT/OIS – IT/PES

IT Technical Forum

11th October 2013

Page 2: Outline

Computing Facilities

2

Part I: Introduction (Pedro A.)

Part II: Technical Solutions (Massimo P., Benjamin F.)– Transport– Long Term Repository– Analytics– Visualization

Part III: Experience by Services (Stefano Z., Spyros L.) – OpenStack Monitoring– Batch LSF Monitoring

Outline

Page 3: Outline

Computing Facilities

3

Motivation– Several independent monitoring activities in IT

• similar overall approach, different tool-chains, similar limitations

– High level services are interdependent • combination of data from different groups necessary, but difficult

– Understanding performance became more important • requires more combined data and complex analysis

– Move to a virtualized dynamic infrastructure • comes with complex new requirements on monitoring

Challenges– Find a shared architecture and tool-chain components

while preserving our investment in monitoring

History

2012 ITTFslide

Page 4: Outline

Computing Facilities

4

Architecture

ApplicationApplication

FeedFeed

Analysis

Long Term Repository

Analysis

Long Term Repository

FeedFeed

NotificationsNotificationsVisualizationVisualization

FeedFeed

ProducerProducer ProducerProducer ProducerProducer ProducerProducer

TransportTransport

Page 5: Outline

Computing Facilities

5

Adopt open source tools – For each architecture block look outside for solutions– Large adoption and strong community support– Fast to adopt, test, and deliver– Easily replaceable by other (better) future solutions

Integrate with new CERN infrastructure– AI project, OpenStack, Puppet, Roger, etc.

Focus on simple adoption (e.g. puppet modules)

Strategy

Page 6: Outline

Computing Facilities

6

Technology

Part II

Page 7: Outline

Computing Facilities

7

Community

Same technologies being used by different teams– HDFS: lemon, syslog, openstack, batch, security, castor– ES: lemon, syslog, openstack, batch– Kibana: lemon, syslog, openstack, batch

CF (lemon, syslog)

Part III

Page 8: Outline

Part II Transport

Page 9: Outline

Computing Facilities

9

Scalable transport needed– Collect operations data

• lemon metrics and syslog• 3rd party applications

Easy integration with providers/consumers

Apache Flume

Motivation

Page 10: Outline

Computing Facilities

10

Distributed service for collecting large amounts of data – Robust and fault tolerant– Horizontally scalable– Many ready to be used input/output plugins– Java based– Apache license

Cloudera is the main contributor– Using their releases– Less frequent but more stable releases

Flume

Page 11: Outline

Computing Facilities

11

Flume event– Byte payload + set of string headers

Flume agent– JVM process hosting “source -> sink” flow(s)

Data Flow

Page 12: Outline

Computing Facilities

12

Many ready-to-be-used plugins

Sources– Avro, Thrift, JMS, Spool dir, Syslog, HTTP, …– Custom sources can be easily implemented

• we do have a dirq source for our use case

Interceptors– Decorate events, filter events

Sources and Sinks

Page 13: Outline

Computing Facilities

13

Many ready-to-be-used plugins

Channels– Memory, File, JDBC– Custom channels can be easily implemented

Sinks– Avro, Thrift, ElasticSearch, Hadoop HDFS & HBase, Java

Logger, IRC, File, Null– Custom sinks can be easily implemented

Sources and Sinks

Page 14: Outline

Computing Facilities

14

Fan-in and fan-out– Enable load balancing

Contextual routing– Based on logic implemented through selectors

Multi-hops flows– Enable layered topologies– Increase reliability, failure resistance

Other Features

Page 15: Outline

Computing Facilities

15

Routing is static– On demand subscriptions are not possible– Requires reconfiguration and restart

No authentication/authorization features– Secure transport available

Java process on client side– Small memory footprint would be nicer

Limitations

Page 16: Outline

Computing Facilities

16

Our Deployment

Page 17: Outline

Computing Facilities

17

Producers– All Puppet nodes– Lemon, Syslog, 3rd party applications

Gateway routing layer– 10 VMs behind DNS load balancer

Elasticsearch sink– 5 VMs behind DNS load balancer – Inserting to ElasticSearch

Hadoop HDFS sink– 5 VMs behind DNS load balancer – Inserting to Hadoop HDFS

Our Deployment

Page 18: Outline

Computing Facilities

18

Needs tuning to correctly size flume layers

Available sources and sinks saved a lot of time

Feedback

Page 19: Outline

Part IILong Term Repository

Page 20: Outline

Computing Facilities

20

Store operations raw data– Long term archival required

Allow future data replay to other tools– Feed real-time engine

Offline processing of collected data– Security data? Syslog data?

Apache Hadoop/HDFS

Motivation

20

Page 21: Outline

Computing Facilities

21

Framework that allows the distributed processing of large data sets across clusters

HDFS is a distributed filesystem designed to run on commodity hardware

– Suitable for applications with large data sets– Designed for batch processing rather than interactive use– High throughput preferred to low latency access

Apache Hadoop

Page 22: Outline

Computing Facilities

22

Small files not welcome– Blocks of 64M,128M

Tens of millions files limit per cluster– Namenode holding in memory files map

Transparent compression not available– Raw text could take much less space

Real-time data access is not possible

Limitations

22

Page 23: Outline

Computing Facilities

23

Cluster provided by IT/DSS– ~500 TB, 13 data nodes

Data stored by hostgroup– Total 1.8 TB since mid July 2013

Daily jobs to aggregate data by month– Large files preferred to many small files by HDFS

Our Usage

23

Page 24: Outline

Part II Analytics

Page 25: Outline

Computing Facilities

25

Real-time queries, clear API

Limited data retention

Multiple scopes technologies

Horizontally scalable and easy to deploy

ElasticSearch

Motivation

25

Page 26: Outline

Computing Facilities

26

Distributed RESTful search and analytics engine

ElasticSearch

26

Page 27: Outline

Computing Facilities

27

Real time– Acquisition: data is indexed in real time– Analytics: explore, understand your data

ElasticSearch

Page 28: Outline

Computing Facilities

28

Schema free– No prior data declaration required

• but possible, to optimize

– Data is injected as-is– Automatic data type discovery

Document oriented (JSON)

ElasticSearch

Page 29: Outline

Computing Facilities

29

Full text search– Apache Lucene is used to provide full text search

• lucene apache documentation

– But not only text• integer/long• float/double• boolean• date• binary• ...

ElasticSearch

Page 30: Outline

Computing Facilities

30

High availability– Shards and replicas auto balanced

RESTful JSON API

ElasticSearch

[root@es-search-node ~] $ curl -XGET http://localhost:9200/_cluster/health?pretty=true{ "cluster_name" : "itmon-es", "status" : "green", "timed_out" : false, "number_of_nodes" : 11, "number_of_data_nodes" : 8, "active_primary_shards" : 2990, "active_shards" : 8970, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0}

Page 31: Outline

Computing Facilities

31

Used by many large companies– Soundcloud

“To provide immediate and relevant results for their online audio distribution platform reaching180 million people”

– Github “20TB of data using ElasticSearch, including 1.3 billion files and 130 billion lines of code”

– Foursquare, Stackoverflow, Salesforce, ...

Distributed under Apache license

ElasticSearch

Page 32: Outline

Computing Facilities

32

Requires a lot of RAM (Java)– Especially on data nodes

IO intensive– Take into account when planning deployment

Shards re-initialisation takes some time (~1h)– Lots of shards and replicas per index, lots of indexes– Not frequent operation, only after full cluster reboot

Authentication not built-in (“bricolage”)– Apache+Shibboleth on top of Jetty plugin

Limitations

Page 33: Outline

Computing Facilities

33

Our Deployment

Fully puppetized

Production cluster– 2 master nodes (no data)

• 16GB RAM, 8 cores CPU

– 1 search node (no data)• 16GB RAM, 8 cores CPU

– 8 data nodes• 48GB RAM, 24 cores CPU • 500GB SSD

Development cluster – Based on medium

and large VMs

Page 34: Outline

Computing Facilities

34

Our Deployment

Security: Jetty plugin– Access control, SSL (also requests logging, Gzip)

Monitoring: many plugins– ElasticHQ, BigDesk, Head, Paramedic, ...

Page 35: Outline

Computing Facilities

35

1 index per day– flume-lemon-YYYY-MM-DD– flume-syslog-YYYY-MM-DD

90 days TTL

10 shards per index

2 replicas per shards

Our Deployment

Page 36: Outline

Computing Facilities

36

Production Cluster– ElasticHQ– HEAD

Demo

Page 37: Outline

Computing Facilities

37

Easy to deploy and manage

Robust, fast, and rich API

Easy query language (DSL)– More features coming with aggregation framework

Feedback

Page 38: Outline

Part II Visualisation

Page 39: Outline

Computing Facilities

39

Dedicated, dynamic and user-friendly dashboards

Horizontally scalable and easy to deploy

Kibana

Motivation

Page 40: Outline

Computing Facilities

40

Visualize time-stamped data from ElasticSearch

Kibana

Page 41: Outline

Computing Facilities

41

“Make sense of a mountain of logs”– Designed to analyze log– Perfectly fits timestamped data (e.g. metrics)

Profit from ElasticSearch power– Search/analyze features exploited

Kibana

Page 42: Outline

Computing Facilities

42

No code required– Simply point & click to build your own dashboard

Kibana

Page 43: Outline

Computing Facilities

43

Open source, community driven– Now fully integrated and supported by ElasticSearch– Provided code/feature contribution

Kibana

Page 44: Outline

Computing Facilities

44

Built with AngularJS– JavaScript MVC framework for client-side rich application– Developed and maintained by Google– No backend: web server delivers only static files

• JS directly queries ElasticSearch

Kibana

Page 45: Outline

Computing Facilities

45

Easy to install– “git clone” OR “tar -xvzf” OR ElasticSearch plugin

Easy to configure– 1-line config file to point to the ElasticSearch cluster– Save its own configuration in ElasticSearch itself

• Possible to export/import dashboards configuration

Kibana

Page 46: Outline

Computing Facilities

46

Based on ElasticSearch plugin– To profit from Jetty authentication– Deployed together with search node

Public (read only) and private (read write) endpoints

Our Deployment

Page 47: Outline

Computing Facilities

47

Production Dashboards– Syslog– Lemon– PDUs

Demo

Page 48: Outline

Computing Facilities

48

Easy to deploy and use

Cool user interface

Fits many use cases– Text (syslog), metrics (lemon)

Still limited feature set– Under active development

Very active community and growing

Feedback

Page 49: Outline

Part III OpenStackMonitoring

Page 50: Outline

Operating Systems & Information Services

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

OIS

Experience with OpenStack 

Page 51: Outline

Operating Systems & Information Services

Cloud Infrastructure

• CERN Cloud Infastructure:– Based on OpenStack– Consist of several services (Keystone, Nova, Glance, Cinder, …)

• Production deployment:– 664 nodes– 20224 cores– ~ 1800 VMs– 100 new servers per week– 15000 nodes by end-2015

Page 52: Outline

Operating Systems & Information Services

Cloud Infrastructure

Main requirements: – Have a centralized copy of our logs to ease problem

investigation– Display OpenStack usage statistics– Display the results of OpenStack functional tests– Maintain a long term history of the status of our

infrastructure– Monitor the usage and status of our databases

Page 53: Outline

Operating Systems & Information Services

Architecture

Page 54: Outline

Operating Systems & Information Services

Architecture

Elasticsearch configuration:– 14 nodes cluster

• 11 data nodes• 3 http nodes

– Fully puppetized– Fully running on virtual machines– Different indices based on the source– New index every day

• 5 shards per index• 2 replicas per shard ( 1 master + 2 replicas )• Custom index templates automatically applied at creation time

– Elasticsearch nodes accept external connections only from Flume gateways and Kibana nodes

Page 55: Outline

Operating Systems & Information Services

Architecture

Kibana configuration:– 3 Kibana nodes– Shibboleth authentication– Reverse proxy for Kibana search queries

Page 56: Outline

Operating Systems & Information Services

Example

OpenStack Nova API Dashboard:- Based on logs- Several information about the service status

Page 57: Outline

Operating Systems & Information Services

Example

Page 58: Outline

Operating Systems & Information Services

Conclusions

Overall good experience:– We have dashboard for every service (~25)– We have a long term history of the status of our

infrastructure– In case of disaster we can recover all the data from HDFS

Some drawbacks:– It took some effort to configure and fine tune Flume and

Elasticsearch on virtual machines– Kibana needs improvements

Page 59: Outline

Part III Batch LSFMonitoring

Page 60: Outline

Platform & Engineering Services

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

PES

Batch 2nd line support tool

IT-PES-PS

[email protected]

Technical student: Spyros Lalos Supervisor: Jérôme Belleman

Page 61: Outline

Cern Batch Service

• 4000 servers with over 50000 CPU cores

• 400000 jobs/day• Running LSF• Fairshare scheduling• High query load on the system

We use various monitoring tools:•SLS (Service Level Status)•Lemon pages•OpenTSDB

Page 62: Outline

OpenTSDB

Metric :Batchjobs.runningtags :host=batchmon10

tags in Java API are manipulated as hashmaps

Page 63: Outline

Batch monitoring

GOAL:

• Tool enabling 2nd line support for end users• More specific/relevant information display

with Kibana• Kibana dashboards opened to Helpdesk,

users• No need for engineer-level personnel

replying to requests

Page 64: Outline

Project Dataflow

Page 65: Outline

ElasticSearch/Kibana

• 1 index per daytest-YYYY-MM-DD

• 5 shards per index• 1 replica per shard

Extended capabilities:

1. AuthN and AuthZ 2. Url configuration•Query set via URL parameters•Web frontpage for users – access to custom dashboards

Page 66: Outline

Kibana

Page 67: Outline

Kibana

Page 68: Outline

Questions??

[email protected]

http://cern.ch/itmon