TITLE IN CAPITAL LETTERS
SUBTITLE IN CAPITAL LETTERS
JAN MUSSLER
Twitter: @JanMussler
zmon.io
#NETWAYS #OSMC 30-11-2016
ZMONOpen Source Monitoring in the Cloud
15 countries19+ million active customers160+ million visits per month200k+ articles3.0+ billion € revenue~ 1.600 employees in tech
Europe's Leading Online Fashion Platform
Visit us: tech.zalando.com
Zalando’s Technology History
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
RADICAL AGILITY
AUTONOMY
➊ One AWS account per Team
➋ Deployment with Docker
➌ Managed SSH Access
➍ REST/OAuth 2.0 mandatory
➎ Traceability of changes
IN A NUTSHELL
STUPS
Internet
*.abc.example.org *.xyz.example.org
Team ABC Team XYZ
ISOLATED AWS ACCOUNTS
EC2EC2
ELBELB
EC2
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
RESPONSIBILITY
OWNERSHIP
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
Host Host Host
Service 4 Service 4Service 4
Host
Service 3 Service 3Service 3
Service 1 Service 1Service 1MonitoringTeam?
Service 2 Service 2
Monitoring the old way?
Team
Team
Team
Team
Build with teams and services in mind ...
Host Host Host
Service 4 Service 4Service 4
Host
Team 3
Service 3 Service 3Service 3Team 2
Service 1 Service 1Service 1Team 1
Service 2 Service 2
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
ZMON.io
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
Flexible and extendable: Checks & Alerts in Python
Integrate: REST APIs, OAUTH2, Auto Discovery
Configurable via UI / API: no restarts required!
Great for teams: autonomy and responsibility
Fast/Scaling metrics: Redis, KairosDB + Grafana 3
ZMON - Highlights ;-)
Good old green and red boxes?
Full authentication for all endpoints
OAUTH2 login flow (e.g. Github login)
“TV Tokens” for “read-only” dashboard login
Grafana 3 bundled and API implemented
Proxy for KairosDB (timeseries db)
ZMON Controller - User Interface and REST API
Display historic data using Grafana 3
Various options for notifications ...
Twilio (phone call)
PUSH
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
ENTITIES
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
● hosts, databases, applications, instances ...● generic key value object● 20000+ entities in our deployment
Entities
{ "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1"}
Entity "node01:8080"
Entity Service (part of controller)
id: localhost:5432
type: postgres
host: localhost
port: 5432shards:
local_zmon_db: "localhost:5432/local_zmon_db"
local-postgres.yaml
Integrated easy-to-use entity store with REST APIBuild your own discovery agent (K8S, …)>zmon entities push local-postgres.yaml
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
CHECKS
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch,
Redis, SNMP/NRPE, tcp,Scalyr, ElasticSearch, …
○ Data filtering/formating/pivoting
● returns "value" object -> dicts everywhere
Checks
SNMP and Nagios NRPE support
REST API to update or use web front end
zmon check-definitions update select-1-check.yaml
Managing checks
name: "Select 1"
owning_team: "Team ZMON"
command: |
sql().execute("select 1 as a").results()
entities:
- type: postgresql
interval: 15
description: "Test connection to PostgreSQL"
select-1-check.yaml
Trial Run - Quick feedback and easier development
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
ALERTS
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
● Attached to a single check, inspect check result
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities(color) and tags
Alerts
Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
Reuse existing checks for core infrastructure
Anyone can add alerts to checks
Monitor application boundaries/dependencies
Make use of inheritance to customize
Sharing and reuse of alerts and checks
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
EXAMPLE
Tokeninfo (GO)Tokeninfo (GO)
Provider (Java)
Provider (Java)
Tokeninfo (GO)Tokeninfo (GO)
C* NodesC* Nodes
C* NodesC* Nodes
Plan B Deployment - Multi Region Setup (JWT issue/verification)
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
Will create “entities” to describe deployment
ELBs, ASGs, Application, instances,...
Crawls AWS API every 60 sec to update
ZMON AWS Agent - Auto Discovery
➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]"
id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]
type: instance
application_id: planb-tokeninfo
host: 172.31.169.6
infrastructure_account: aws:999
instance_type: c4.xlarge
ip: 172.31.169.6
ports: { '9020': 9020, '9021': 9021 }
region: eu-west-1
source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44
stack_name: planb-tokeninfo-eu-west-1
stack_version: cd44
Example Instance Entity
➜ ~ zmon entities get " elb-data-service-cd79c9[aws:...:eu-central-1] "
id: elb-data-service-cd79c9[aws:...:eu-central-1]
type: elb
name: data-service-cd79c9
active_members: 5
cloudwatch_name: app/data-service-cd79c9/18b164bfa427486d
dns_name: data-service-cd79c9-961635181.eu-central-1.elb.amazonaws.com
dns_traffic: 'true'
dns_weight: 200
elb_type: application
members: 5
region: eu-central-1
scheme: internet-facing
Example Instance Entity
Instance Metrics● Memory usage● Disk space usage● CPU usage● Application logs● Application metrics
Monitoring Plan-B EC2 instances on AWS
Scalyr AgentLog shipping
PrometheusNode Agent:9100/metrics
Taupage AMI (Ubuntu base)
Application ContainerGo / Spring Boot / CassandraDocker run time:8080 -> app:7979 -> metrics
Jolokia Request Example
Check Results
Check result - Grafana 3 link
AWS UI deep link
Monitor your deployments … data tagged with version
Annotated Metric Data in Grafana
HTTP requests reading JSON application metrics
Read JMX data via Jolokia/HTTP for Cassandra
Read Prometheus Node data for EC2 metrics
CloudWatch() queries for ELB metrics
Scalyr API queries for application logs
Check commands used so far
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
DEPLOYMENT
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
Workers(Python)
Workers(Python)
ZMON Core + UI + KairosDB
Scheduler(jvm) Redis Worker
(Python)
KairosDB(Java)
Controller(Java)
PostgreSQL
Queue/State
CLI(Python)
Check/Alert definitionEntity data
Cassandra
Frontend(AngularJS)
Metric Cache
● Scheduler supports queue filters by entity○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter○ only handles entities with {"dc":"dc1"}
● Worker can report home using:○ Redis (we use this across DCs)○ HTTPS (AWS->DC)
Multi DC / Zone deployment possible
ZMON in AWS / Multi DC Setup
*.foo.example.org *.bar.example.org
Team "Foo" Team "Bar"
EC2Instance
EC2InstanceEC2
InstanceEC2
Instance
ZMON Appliance
ZMON ApplianceEC2
InstanceEC2
Instance
ZMONData Service
ELB ELB
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
MICROSERVICES
Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters
Application metrics
Continued ...
Spring Boot (extending metrics)https://github.com/zalando/zmon-actuator
Python (Swagger first on Flask)https://github.com/zalando/connexion
Clojure (Swagger first)https://github.com/zalando-stups/friboo/
Scala Playhttps://github.com/zalando-incubator/markscheider
Example libraries and framework support ...
Demo:https://demo.zmon.io
ZMON and Slack:https://zmon.io && https://slack.zmon.io
Documentation:https://docs.zmon.io
Zalando Tech:https://tech.zalando.com
Expose your data / Convention on key names/structure
{ "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512, "zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.min": 1114}