40
Prometheus A next-generation monitoring system Fabian Reinartz – Production Engineer, SoundCloud Ltd.

Prometheus – a next-gen Monitoring System

Embed Size (px)

Citation preview

Page 1: Prometheus – a next-gen Monitoring System

PrometheusA next-generation monitoring system

Fabian Reinartz – Production Engineer, SoundCloud Ltd.

Page 2: Prometheus – a next-gen Monitoring System

Monitoring at SC 2012 – from monolith ...

Page 3: Prometheus – a next-gen Monitoring System

... to micro services

Page 4: Prometheus – a next-gen Monitoring System

Monitoring at SC 2012

Service A

Service B

Service C

StatsD Graphite

Page 5: Prometheus – a next-gen Monitoring System

History – monitoring at SoundCloud 2012

Source: http://eugenedvorkin.com/seven-micro-services-architecture-problems-and-solutions/

Page 6: Prometheus – a next-gen Monitoring System

History – monitoring at SoundCloud 2012

Source: http://blog.sflow.com/2011/12/using-ganglia-to-monitor-java-virtual.html

Page 7: Prometheus – a next-gen Monitoring System

History – monitoring at SoundCloud 2012

Source: http://www.bellarmine.edu/faculty/amahmood/tier3/monitoring.html

Page 8: Prometheus – a next-gen Monitoring System

P R O M E T H E U S

Page 9: Prometheus – a next-gen Monitoring System

Prometheus

- started by Matt Proud and Julius Volz as an Open Source project

- first commit 24-11-2012

- public announcement in January 2015

- inspired by Borgmon

- not Borgmon

Page 10: Prometheus – a next-gen Monitoring System

Features – multi-dimensional data model

http_requests_total{instance=”web-1”, path=”/index”, status=”401”, method=”GET”}

#metrics x #labels x #values ▶ millions of time series

Page 11: Prometheus – a next-gen Monitoring System

Features – powerful query language

topk(3, sum by(path, method) (rate(http_requests_total{status=~”5..”}[5m])

))

histogram_quantile(0.99, sum by(le, path) (

rate(http_requests_duration_seconds_bucket[5m])

))

Page 12: Prometheus – a next-gen Monitoring System

Features – powerful query language

topk(3, sum by(path, method) (rate(http_requests_total{status=~”5..”}[5m])

))

{path=”/api/comments”, method=”POST”} 105.4

{path=”/api/user/:id”, method=”GET”} 34.122

{path=”/api/comment/:id/edit”, method=”POST”} 29.31

Page 13: Prometheus – a next-gen Monitoring System

Features – easy to use, yet scalable

- single static binary, no dependencies

$ go get github.com/prometheus/prometheus/cmd/...

$ prometheus

- local storage

- high-throughput [millions of time series, 380,000 samples/sec]

- efficient compression

Page 14: Prometheus – a next-gen Monitoring System
Page 15: Prometheus – a next-gen Monitoring System

Integrations

Page 16: Prometheus – a next-gen Monitoring System

Instrument – natively

var httpDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Namespace: namespace, Name: "http_request_duration_seconds", Help: "A histogram of HTTP request durations.", Buckets: prometheus.ExponentialBuckets(0.0001, 1.5, 25), }, []string{"path", "method", "status"},)

func handleAPI(w http.ResponseWriter, r *http.Request) { start := time.Now()

// do work

httpDuration.WithLabelValues(r.URL.Path, r.Method, status).Observe(time.Since(start).Seconds())}

Page 17: Prometheus – a next-gen Monitoring System

Features – built-in expression browser

Page 18: Prometheus – a next-gen Monitoring System

Features – native Grafana support

Page 19: Prometheus – a next-gen Monitoring System

Features – PromDash

Page 20: Prometheus – a next-gen Monitoring System
Page 21: Prometheus – a next-gen Monitoring System

D O E S I T S C A L E ?

Page 22: Prometheus – a next-gen Monitoring System

Features – federation & sharding

Cluster A Cluster B

Cluster C

service metrics container metrics

Page 23: Prometheus – a next-gen Monitoring System
Page 24: Prometheus – a next-gen Monitoring System

S E R V I C E D I S C O V E R Y

Page 25: Prometheus – a next-gen Monitoring System

DNS SRV

$ dig +short SRV all.foo-api.srv.int.example.com0 0 4738 ip-10-22-11-32.int.example.com.0 0 3433 ip-10-22-11-32.int.example.com.0 0 5934 ip-10-22-11-34.int.example.com.0 0 5093 ip-10-22-11-42.int.example.com.0 0 4589 ip-10-22-11-43.int.example.com.0 0 9848 ip-10-22-12-11.int.example.com.[...]

Page 26: Prometheus – a next-gen Monitoring System

DNS SRV

scrape_configs:- job_name: "foo-api" metrics_path: "/metrics"

dns_sd_configs: - names: ["all.foo-api.srv.int.example.com"] refresh_interval: 10s

Page 27: Prometheus – a next-gen Monitoring System

Fancy SD

- Consul- Kubernetes- Zookeeper- EC2- Mesos-Marathon

- … any via file-based plugins

Relabel based on SD data.

Page 28: Prometheus – a next-gen Monitoring System

Relabeling

relabel_config: action: replace source_labels: [__address__, __telemetry_port] target_label: __address__ regex: (.+):(.+);(.+) replacement: $1:$3

OUT

“__address__”: “10.44.12.135:82432”

“__telemetry_port”: “82432”

“cluster”: “AB”

“environment”: “production”

IN

“__address__”: “10.44.12.135:25431”

“__telemetry_port”: “82432”

“cluster”: “AB”

“environment”: “production”

Page 29: Prometheus – a next-gen Monitoring System

AWS EC2

scrape_configs:- job_name: "foo-api" metrics_path: "/metrics" ec2_sd_configs: - region: us-east-1 refresh_interval: 60s port: 80

The following meta labels are available during relabeling:- __meta_ec2_instance_id: the EC2 instance ID- __meta_ec2_public_ip: the public IP address of the instance- __meta_ec2_private_ip: the private IP address of the instance, if present- __meta_ec2_tag_<tagkey>: each tag value of the instance

Page 30: Prometheus – a next-gen Monitoring System

AWS EC2 – relabeling

relabel_configs:- source_labels: [__meta_ec2_tag_Type] action: keep regex: foo-api- source_labels: [__meta_ec2_tag_Deployment] action: replace target_label: deployment regex: (.+) replacement: $1

Page 31: Prometheus – a next-gen Monitoring System

A L E R T M A N A G E R

Page 32: Prometheus – a next-gen Monitoring System

Alerting

- no opinions

- directly defined on time series data

- verbose on firing ▶ compact but detailed on notifcation

Page 33: Prometheus – a next-gen Monitoring System

Alerting

ALERT HighErrorRate

IF sum by(job, path)(rate(http_requests_total{status=~”5..”}[5m])) /

sum by(job, path)(rate(http_requests_total[5m])) * 100 > 1

FOR 10m

SUMMARY “high number of 5xx errors”

DESCRIPTION “{{$labels.job}} has {{$value}}% 5xx errors on {{ $labels.path }}”

Page 34: Prometheus – a next-gen Monitoring System

Alerting

{path=”/api/comments”, method=”POST”} 5.43

{path=”/api/user/:id”, method=”GET”} 1.22

{path=”/api/comment/:id/edit”, method=”POST”} 1.01

Page 35: Prometheus – a next-gen Monitoring System

Alerting

ALERT HighErrorRate

IF ... * 100 > 1

FOR 10m

WITH { severity = “warning” } …

ALERT HighErrorRate

IF ... * 100 > 3

FOR 10m

WITH { severity = “critical” } …

Page 36: Prometheus – a next-gen Monitoring System

ALERTMANAGER

a l e r t s

silence

inhibit

g r o u p d e d u p r o u t e

PagerDuty

Mail

Slack

...

Page 37: Prometheus – a next-gen Monitoring System

Alerting

ALERT DiskWillFillIn4Hours

IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0

FOR 5m

SUMMARY “device filling up”

DESCRIPTION “{{$labels.device}} mounted on {{$labels.mountpoint}} on

{{$labels.instance}} will fill up within 4 hours.”

http://www.robustperception.io/reduce-noise-from-disk-space-alerts/

Page 38: Prometheus – a next-gen Monitoring System

D E M O

Page 39: Prometheus – a next-gen Monitoring System

Turing complete

http://www.robustperception.io/conways-life-in-prometheus/

Page 40: Prometheus – a next-gen Monitoring System

Recording rules

job:http_requests:rate5m = sum by(job) (rate(http_requests_total[5m])

)