ZMON Open Source Monitoring in the Cloud - NETWAYS · PDF fileZMON Open Source Monitoring in...

Preview:

Citation preview

TITLE IN CAPITAL LETTERS

SUBTITLE IN CAPITAL LETTERS

JAN MUSSLER

jan.mussler@zalando.de

Twitter: @JanMussler

zmon.io

#NETWAYS #OSMC 30-11-2016

ZMONOpen Source Monitoring in the Cloud

15 countries19+ million active customers160+ million visits per month200k+ articles3.0+ billion € revenue~ 1.600 employees in tech

Europe's Leading Online Fashion Platform

Visit us: tech.zalando.com

Zalando’s Technology History

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

RADICAL AGILITY

AUTONOMY

➊ One AWS account per Team

➋ Deployment with Docker

➌ Managed SSH Access

➍ REST/OAuth 2.0 mandatory

➎ Traceability of changes

IN A NUTSHELL

STUPS

Internet

*.abc.example.org *.xyz.example.org

Team ABC Team XYZ

ISOLATED AWS ACCOUNTS

EC2EC2

ELBELB

EC2

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

RESPONSIBILITY

OWNERSHIP

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

Host Host Host

Service 4 Service 4Service 4

Host

Service 3 Service 3Service 3

Service 1 Service 1Service 1MonitoringTeam?

Service 2 Service 2

Monitoring the old way?

Team

Team

Team

Team

Build with teams and services in mind ...

Host Host Host

Service 4 Service 4Service 4

Host

Team 3

Service 3 Service 3Service 3Team 2

Service 1 Service 1Service 1Team 1

Service 2 Service 2

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

ZMON.io

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

Flexible and extendable: Checks & Alerts in Python

Integrate: REST APIs, OAUTH2, Auto Discovery

Configurable via UI / API: no restarts required!

Great for teams: autonomy and responsibility

Fast/Scaling metrics: Redis, KairosDB + Grafana 3

ZMON - Highlights ;-)

Good old green and red boxes?

Full authentication for all endpoints

OAUTH2 login flow (e.g. Github login)

“TV Tokens” for “read-only” dashboard login

Grafana 3 bundled and API implemented

Proxy for KairosDB (timeseries db)

ZMON Controller - User Interface and REST API

Display historic data using Grafana 3

Various options for notifications ...

E-Mail

Twilio (phone call)

PUSH

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

ENTITIES

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

● hosts, databases, applications, instances ...● generic key value object● 20000+ entities in our deployment

Entities

{ "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1"}

Entity "node01:8080"

Entity Service (part of controller)

id: localhost:5432

type: postgres

host: localhost

port: 5432shards:

local_zmon_db: "localhost:5432/local_zmon_db"

local-postgres.yaml

Integrated easy-to-use entity store with REST APIBuild your own discovery agent (K8S, …)>zmon entities push local-postgres.yaml

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

CHECKS

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

● select subset of entities

● executes Python expression

○ powerful using eval with custom context

○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch,

Redis, SNMP/NRPE, tcp,Scalyr, ElasticSearch, …

○ Data filtering/formating/pivoting

● returns "value" object -> dicts everywhere

Checks

SNMP and Nagios NRPE support

REST API to update or use web front end

zmon check-definitions update select-1-check.yaml

Managing checks

name: "Select 1"

owning_team: "Team ZMON"

command: |

sql().execute("select 1 as a").results()

entities:

- type: postgresql

interval: 15

description: "Test connection to PostgreSQL"

select-1-check.yaml

Trial Run - Quick feedback and easier development

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

ALERTS

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

● Attached to a single check, inspect check result

● Defines team and responsible team

● Allows inheritance from other alert

● Evaluates Python expression yielding True/False

● No "WARNING" state, no "UNKNOWN" state

● Priorities(color) and tags

Alerts

Downtimes

● Set or schedule downtimes using the UI

● Use API to automate downtimes, e.g. in deployment tool

Reuse existing checks for core infrastructure

Anyone can add alerts to checks

Monitor application boundaries/dependencies

Make use of inheritance to customize

Sharing and reuse of alerts and checks

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

EXAMPLE

Tokeninfo (GO)Tokeninfo (GO)

Provider (Java)

Provider (Java)

Tokeninfo (GO)Tokeninfo (GO)

C* NodesC* Nodes

C* NodesC* Nodes

Plan B Deployment - Multi Region Setup (JWT issue/verification)

C* NodesProvider (Java)ELB

Tokeninfo (Go)ELB

C* NodesProvider (Java)ELB

Tokeninfo (Go)ELB

Will create “entities” to describe deployment

ELBs, ASGs, Application, instances,...

Crawls AWS API every 60 sec to update

ZMON AWS Agent - Auto Discovery

➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]"

id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]

type: instance

application_id: planb-tokeninfo

host: 172.31.169.6

infrastructure_account: aws:999

instance_type: c4.xlarge

ip: 172.31.169.6

ports: { '9020': 9020, '9021': 9021 }

region: eu-west-1

source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44

stack_name: planb-tokeninfo-eu-west-1

stack_version: cd44

Example Instance Entity

➜ ~ zmon entities get " elb-data-service-cd79c9[aws:...:eu-central-1] "

id: elb-data-service-cd79c9[aws:...:eu-central-1]

type: elb

name: data-service-cd79c9

active_members: 5

cloudwatch_name: app/data-service-cd79c9/18b164bfa427486d

dns_name: data-service-cd79c9-961635181.eu-central-1.elb.amazonaws.com

dns_traffic: 'true'

dns_weight: 200

elb_type: application

members: 5

region: eu-central-1

scheme: internet-facing

Example Instance Entity

Instance Metrics● Memory usage● Disk space usage● CPU usage● Application logs● Application metrics

Monitoring Plan-B EC2 instances on AWS

Scalyr AgentLog shipping

PrometheusNode Agent:9100/metrics

Taupage AMI (Ubuntu base)

Application ContainerGo / Spring Boot / CassandraDocker run time:8080 -> app:7979 -> metrics

Jolokia Request Example

Check Results

Check result - Grafana 3 link

AWS UI deep link

Monitor your deployments … data tagged with version

Annotated Metric Data in Grafana

HTTP requests reading JSON application metrics

Read JMX data via Jolokia/HTTP for Cassandra

Read Prometheus Node data for EC2 metrics

CloudWatch() queries for ELB metrics

Scalyr API queries for application logs

Check commands used so far

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

DEPLOYMENT

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

Workers(Python)

Workers(Python)

ZMON Core + UI + KairosDB

Scheduler(jvm) Redis Worker

(Python)

KairosDB(Java)

Controller(Java)

PostgreSQL

Queue/State

CLI(Python)

Check/Alert definitionEntity data

Cassandra

Frontend(AngularJS)

Metric Cache

● Scheduler supports queue filters by entity○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters

● Scheduler can apply base filter○ only handles entities with {"dc":"dc1"}

● Worker can report home using:○ Redis (we use this across DCs)○ HTTPS (AWS->DC)

Multi DC / Zone deployment possible

ZMON in AWS / Multi DC Setup

*.foo.example.org *.bar.example.org

Team "Foo" Team "Bar"

EC2Instance

EC2InstanceEC2

InstanceEC2

Instance

ZMON Appliance

ZMON ApplianceEC2

InstanceEC2

Instance

ZMONData Service

ELB ELB

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

MICROSERVICES

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

Application metrics

Continued ...

Spring Boot (extending metrics)https://github.com/zalando/zmon-actuator

Python (Swagger first on Flask)https://github.com/zalando/connexion

Clojure (Swagger first)https://github.com/zalando-stups/friboo/

Scala Playhttps://github.com/zalando-incubator/markscheider

Example libraries and framework support ...

Demo:https://demo.zmon.io

ZMON and Slack:https://zmon.io && https://slack.zmon.io

Documentation:https://docs.zmon.io

Zalando Tech:https://tech.zalando.com

Expose your data / Convention on key names/structure

{ "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512, "zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.min": 1114}

Recommended