PrometheusA next-generation monitoring system
Fabian Reinartz – Production Engineer, SoundCloud Ltd.
Monitoring at SC 2012 – from monolith ...
... to micro services
Monitoring at SC 2012
Service A
Service B
Service C
StatsD Graphite
History – monitoring at SoundCloud 2012
Source: http://eugenedvorkin.com/seven-micro-services-architecture-problems-and-solutions/
History – monitoring at SoundCloud 2012
Source: http://blog.sflow.com/2011/12/using-ganglia-to-monitor-java-virtual.html
History – monitoring at SoundCloud 2012
Source: http://www.bellarmine.edu/faculty/amahmood/tier3/monitoring.html
P R O M E T H E U S
Prometheus
- started by Matt Proud and Julius Volz as an Open Source project
- first commit 24-11-2012
- public announcement in January 2015
- inspired by Borgmon
- not Borgmon
Features – multi-dimensional data model
http_requests_total{instance=”web-1”, path=”/index”, status=”401”, method=”GET”}
#metrics x #labels x #values ▶ millions of time series
Features – powerful query language
topk(3, sum by(path, method) (rate(http_requests_total{status=~”5..”}[5m])
))
histogram_quantile(0.99, sum by(le, path) (
rate(http_requests_duration_seconds_bucket[5m])
))
Features – powerful query language
topk(3, sum by(path, method) (rate(http_requests_total{status=~”5..”}[5m])
))
{path=”/api/comments”, method=”POST”} 105.4
{path=”/api/user/:id”, method=”GET”} 34.122
{path=”/api/comment/:id/edit”, method=”POST”} 29.31
Features – easy to use, yet scalable
- single static binary, no dependencies
$ go get github.com/prometheus/prometheus/cmd/...
$ prometheus
- local storage
- high-throughput [millions of time series, 380,000 samples/sec]
- efficient compression
Integrations
Instrument – natively
var httpDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Namespace: namespace, Name: "http_request_duration_seconds", Help: "A histogram of HTTP request durations.", Buckets: prometheus.ExponentialBuckets(0.0001, 1.5, 25), }, []string{"path", "method", "status"},)
func handleAPI(w http.ResponseWriter, r *http.Request) { start := time.Now()
// do work
httpDuration.WithLabelValues(r.URL.Path, r.Method, status).Observe(time.Since(start).Seconds())}
Features – built-in expression browser
Features – native Grafana support
Features – PromDash
D O E S I T S C A L E ?
Features – federation & sharding
Cluster A Cluster B
Cluster C
service metrics container metrics
S E R V I C E D I S C O V E R Y
DNS SRV
$ dig +short SRV all.foo-api.srv.int.example.com0 0 4738 ip-10-22-11-32.int.example.com.0 0 3433 ip-10-22-11-32.int.example.com.0 0 5934 ip-10-22-11-34.int.example.com.0 0 5093 ip-10-22-11-42.int.example.com.0 0 4589 ip-10-22-11-43.int.example.com.0 0 9848 ip-10-22-12-11.int.example.com.[...]
DNS SRV
scrape_configs:- job_name: "foo-api" metrics_path: "/metrics"
dns_sd_configs: - names: ["all.foo-api.srv.int.example.com"] refresh_interval: 10s
Fancy SD
- Consul- Kubernetes- Zookeeper- EC2- Mesos-Marathon
- … any via file-based plugins
Relabel based on SD data.
Relabeling
relabel_config: action: replace source_labels: [__address__, __telemetry_port] target_label: __address__ regex: (.+):(.+);(.+) replacement: $1:$3
OUT
“__address__”: “10.44.12.135:82432”
“__telemetry_port”: “82432”
“cluster”: “AB”
“environment”: “production”
IN
“__address__”: “10.44.12.135:25431”
“__telemetry_port”: “82432”
“cluster”: “AB”
“environment”: “production”
AWS EC2
scrape_configs:- job_name: "foo-api" metrics_path: "/metrics" ec2_sd_configs: - region: us-east-1 refresh_interval: 60s port: 80
The following meta labels are available during relabeling:- __meta_ec2_instance_id: the EC2 instance ID- __meta_ec2_public_ip: the public IP address of the instance- __meta_ec2_private_ip: the private IP address of the instance, if present- __meta_ec2_tag_<tagkey>: each tag value of the instance
AWS EC2 – relabeling
relabel_configs:- source_labels: [__meta_ec2_tag_Type] action: keep regex: foo-api- source_labels: [__meta_ec2_tag_Deployment] action: replace target_label: deployment regex: (.+) replacement: $1
A L E R T M A N A G E R
Alerting
- no opinions
- directly defined on time series data
- verbose on firing ▶ compact but detailed on notifcation
Alerting
ALERT HighErrorRate
IF sum by(job, path)(rate(http_requests_total{status=~”5..”}[5m])) /
sum by(job, path)(rate(http_requests_total[5m])) * 100 > 1
FOR 10m
SUMMARY “high number of 5xx errors”
DESCRIPTION “{{$labels.job}} has {{$value}}% 5xx errors on {{ $labels.path }}”
Alerting
{path=”/api/comments”, method=”POST”} 5.43
{path=”/api/user/:id”, method=”GET”} 1.22
{path=”/api/comment/:id/edit”, method=”POST”} 1.01
Alerting
ALERT HighErrorRate
IF ... * 100 > 1
FOR 10m
WITH { severity = “warning” } …
ALERT HighErrorRate
IF ... * 100 > 3
FOR 10m
WITH { severity = “critical” } …
ALERTMANAGER
a l e r t s
silence
inhibit
g r o u p d e d u p r o u t e
PagerDuty
Slack
...
Alerting
ALERT DiskWillFillIn4Hours
IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0
FOR 5m
SUMMARY “device filling up”
DESCRIPTION “{{$labels.device}} mounted on {{$labels.mountpoint}} on
{{$labels.instance}} will fill up within 4 hours.”
http://www.robustperception.io/reduce-noise-from-disk-space-alerts/
D E M O
Turing complete
http://www.robustperception.io/conways-life-in-prometheus/
Recording rules
job:http_requests:rate5m = sum by(job) (rate(http_requests_total[5m])
)