Prometheus – a next-gen Monitoring System

PrometheusA next-generation monitoring system

Fabian Reinartz – Production Engineer, SoundCloud Ltd.

Monitoring at SC 2012 – from monolith ...

... to micro services

Monitoring at SC 2012

Service A

Service B

Service C

StatsD Graphite

History – monitoring at SoundCloud 2012

Source: http://eugenedvorkin.com/seven-micro-services-architecture-problems-and-solutions/


Source: http://blog.sflow.com/2011/12/using-ganglia-to-monitor-java-virtual.html


Source: http://www.bellarmine.edu/faculty/amahmood/tier3/monitoring.html

P R O M E T H E U S

Prometheus

- started by Matt Proud and Julius Volz as an Open Source project

- first commit 24-11-2012

- public announcement in January 2015

- inspired by Borgmon

- not Borgmon

Features – multi-dimensional data model

http_requests_total{instance=”web-1”, path=”/index”, status=”401”, method=”GET”}

#metrics x #labels x #values ▶ millions of time series

Features – powerful query language

topk(3, sum by(path, method) (rate(http_requests_total{status=~”5..”}[5m])

))

histogram_quantile(0.99, sum by(le, path) (

rate(http_requests_duration_seconds_bucket[5m])

))

Features – powerful query language

topk(3, sum by(path, method) (rate(http_requests_total{status=~”5..”}[5m])

))

{path=”/api/comments”, method=”POST”} 105.4

{path=”/api/user/:id”, method=”GET”} 34.122

{path=”/api/comment/:id/edit”, method=”POST”} 29.31

Features – easy to use, yet scalable

- single static binary, no dependencies

$ go get github.com/prometheus/prometheus/cmd/...

$ prometheus

- local storage

- high-throughput [millions of time series, 380,000 samples/sec]

- efficient compression

Integrations

Instrument – natively

var httpDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Namespace: namespace, Name: "http_request_duration_seconds", Help: "A histogram of HTTP request durations.", Buckets: prometheus.ExponentialBuckets(0.0001, 1.5, 25), }, []string{"path", "method", "status"},)

func handleAPI(w http.ResponseWriter, r *http.Request) { start := time.Now()

// do work

httpDuration.WithLabelValues(r.URL.Path, r.Method, status).Observe(time.Since(start).Seconds())}

Features – built-in expression browser

Features – native Grafana support

Features – PromDash

D O E S I T S C A L E ?

Features – federation & sharding

Cluster A Cluster B

Cluster C

service metrics container metrics

S E R V I C E D I S C O V E R Y

DNS SRV

$ dig +short SRV all.foo-api.srv.int.example.com0 0 4738 ip-10-22-11-32.int.example.com.0 0 3433 ip-10-22-11-32.int.example.com.0 0 5934 ip-10-22-11-34.int.example.com.0 0 5093 ip-10-22-11-42.int.example.com.0 0 4589 ip-10-22-11-43.int.example.com.0 0 9848 ip-10-22-12-11.int.example.com.[...]

DNS SRV

scrape_configs:- job_name: "foo-api" metrics_path: "/metrics"

dns_sd_configs: - names: ["all.foo-api.srv.int.example.com"] refresh_interval: 10s

Fancy SD

- Consul- Kubernetes- Zookeeper- EC2- Mesos-Marathon

- … any via file-based plugins

Relabel based on SD data.

Relabeling

relabel_config: action: replace source_labels: [__address__, __telemetry_port] target_label: __address__ regex: (.+):(.+);(.+) replacement: $1:$3

OUT

“__address__”: “10.44.12.135:82432”

“__telemetry_port”: “82432”

“cluster”: “AB”

“environment”: “production”

IN

“__address__”: “10.44.12.135:25431”

“__telemetry_port”: “82432”

“cluster”: “AB”

“environment”: “production”

AWS EC2

scrape_configs:- job_name: "foo-api" metrics_path: "/metrics" ec2_sd_configs: - region: us-east-1 refresh_interval: 60s port: 80

The following meta labels are available during relabeling:- __meta_ec2_instance_id: the EC2 instance ID- __meta_ec2_public_ip: the public IP address of the instance- __meta_ec2_private_ip: the private IP address of the instance, if present- __meta_ec2_tag_<tagkey>: each tag value of the instance

AWS EC2 – relabeling

relabel_configs:- source_labels: [__meta_ec2_tag_Type] action: keep regex: foo-api- source_labels: [__meta_ec2_tag_Deployment] action: replace target_label: deployment regex: (.+) replacement: $1

A L E R T M A N A G E R

Alerting

- no opinions

- directly defined on time series data

- verbose on firing ▶ compact but detailed on notifcation

Alerting

ALERT HighErrorRate

IF sum by(job, path)(rate(http_requests_total{status=~”5..”}[5m])) /

sum by(job, path)(rate(http_requests_total[5m])) * 100 > 1

FOR 10m

SUMMARY “high number of 5xx errors”

DESCRIPTION “{{$labels.job}} has {{$value}}% 5xx errors on {{ $labels.path }}”

Alerting

{path=”/api/comments”, method=”POST”} 5.43

{path=”/api/user/:id”, method=”GET”} 1.22

{path=”/api/comment/:id/edit”, method=”POST”} 1.01

Alerting

ALERT HighErrorRate

IF ... * 100 > 1

FOR 10m

WITH { severity = “warning” } …

ALERT HighErrorRate

IF ... * 100 > 3

FOR 10m

WITH { severity = “critical” } …

ALERTMANAGER

a l e r t s

silence

inhibit

g r o u p d e d u p r o u t e

PagerDuty

Mail

Slack

...

Alerting

ALERT DiskWillFillIn4Hours

IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0

FOR 5m

SUMMARY “device filling up”

DESCRIPTION “{{$labels.device}} mounted on {{$labels.mountpoint}} on

{{$labels.instance}} will fill up within 4 hours.”

http://www.robustperception.io/reduce-noise-from-disk-space-alerts/



D E M O

Turing complete

http://www.robustperception.io/conways-life-in-prometheus/



Recording rules

job:http_requests:rate5m = sum by(job) (rate(http_requests_total[5m])

)

Technology

Prometheus – a next-gen Monitoring System