Monitoring microservices with Prometheus

Monitoring Microserviceswith Prometheus

Tobias Schmidt - MicroCPH May 17, 2017

github.com/grobie @dagrobie [email protected]

https://github.com/grobie

https://twitter.com/dagrobie

mailto:[email protected]


https://prometheus.io

Monitoring

● Ability to observe and understand systems and their behavior.○ Know when things go wrong○ Understand and debug service misbehavior○ Detect trends and act in advance

● Blackbox vs. Whitebox monitoring○ Blackbox: Observes systems externally with periodic checks○ Whitebox: Provides internally observed metrics

● Whitebox: Different levels of granularity○ Logging○ Tracing○ Metrics

Monitoring

● Metrics monitoring system and time series database○ Instrumentation (client libraries and exporters)○ Metrics collection, processing and storage○ Querying, alerting and dashboards○ Analysis, trending, capacity planning○ Focused on infrastructure, not business metrics

● Key features○ Powerful query language for metrics with label dimensions○ Stable and simple operation○ Built for modern dynamic deploy environments○ Easy setup

● What it’s not○ Logging system○ Designed for perfect answers

Prometheus

Instrumentation case studyGusta: a simple like service

● Service to handle everything around liking a resource

○ List all liked likes on a resource

○ Create a like on a resource

○ Delete a like on a resource

● Implementation

○ Written in golang

○ Uses the gokit.io toolkit

Gusta overview

http://gokit.io

// Like represents all information of a single like.

type Like struct {

ResourceID string `json:"resourceID"`

UserID string `json:"userID"`

CreatedAt time.Time `json:"createdAt"`

}

// Service describes all methods provided by the gusta service.

type Service interface {

ListResourceLikes(resourceID string) ([]Like, error)

LikeResource(resourceID, userID string) error

UnlikeResource(resourceID, userID string) error

}

Gusta core

// main.go

var store gusta.Store

store = gusta.NewMemoryStore()

var s gusta.Service

s = gusta.NewService(store)

s = gusta.LoggingMiddleware(logger)(s)

var h http.Handler

h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"))

http.Handle("/", h)

if err := http.ListenAndServe(*httpAddr, nil); err != nil {

logger.Log("exit error", err)

}

Gusta server

./gusta

ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080

ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null

ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null

ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null

ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null

ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null



ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not

found"

ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null


Gusta server

Basic InstrumentationProviding operational insight

● “Four golden signals” cover the essentials

○ Latency

○ Traffic

○ Errors

○ Saturation

● Similar concepts: RED and USE methods

○ Request: Rate, Errors, Duration

○ Utilization, Saturation, Errors

● Information about the service itself

● Interaction with dependencies (other services, databases, etc.)

What information should be provided?

● Direct instrumentation○ Traffic, Latency, Errors, Saturation○ Service specific metrics (and interaction with dependencies)○ Prometheus client libraries provide packages to instrument HTTP

requests out of the box

● Exporters○ Utilization, Saturation○ node_exporter CPU, memory, IO utilization per host○ wmi_exporter does the same for Windows○ cAdvisor (Container advisor) provides similar metrics for each container

Where to get the information from?

// main.go

import "github.com/prometheus/client_golang/prometheus"

var registry = prometheus.NewRegistry()

registry.MustRegister(

prometheus.NewGoCollector(),

prometheus.NewProcessCollector(os.Getpid(), ""),

)

// Pass down registry when creating HTTP handlers.

h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry)

Initializing Prometheus client library

var h http.Handler = listResourceLikesHandler

var method, path string = "GET", "/api/v1/likes/{id}"

requests := prometheus.NewCounterVec(

prometheus.CounterOpts{

Name: "gusta_http_server_requests_total",

Help: "Total number of requests handled by the HTTP server.",

ConstLabels: prometheus.Labels{"method": method, "path": path},

},

[]string{"code"},

)

registry.MustRegister(requests)

h = promhttp.InstrumentHandlerCounter(requests, h)

Counting HTTP requests

var h http.Handler = listResourceLikesHandler

var method, path string = "GET", "/api/v1/likes/{id}"

requestDuration := prometheus.NewHistogramVec(

prometheus.HistogramOpts{

Name: "gusta_http_server_request_duration_seconds",

Help: "A histogram of latencies for requests.",

Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1},

ConstLabels: prometheus.Labels{"method": method, "path": path},

},

[]string{},

)

registry.MustRegister(requestDuration)

h = promhttp.InstrumentHandlerDuration(requestDuration, h)

Observing HTTP request latency

Exposing metricsObserving the current state

● Prometheus is a pull based monitoring system

○ Instances expose an HTTP endpoint to expose their metrics

○ Prometheus uses service discovery or static target lists to collect the state periodically

● Centralized management

○ Prometheus decides how often to scrape instances

● Prometheus stores the data on local disc

○ In a big outage, you could run Prometheus on your laptop!

How to collect the metrics?

// main.go

// ...

http.Handle("/metrics", promhttp.HandlerFor(

registry,

promhttp.HandlerOpts{},

))

Exposing the metrics via HTTP

curl -s http://localhost:8080/metrics | grep requests

# HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server.

# TYPE gusta_http_server_requests_total counter

gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3

gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429

gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51

gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14

gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3

Request metrics

curl -s http://localhost:8080/metrics | grep request_duration

# HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests.

# TYPE gusta_http_server_request_duration_seconds histogram

...

gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414






gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429

gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984

gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429

...

Latency metrics

curl -s http://localhost:8080/metrics | grep process

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.

# TYPE process_cpu_seconds_total counter

process_cpu_seconds_total 892.78

# HELP process_max_fds Maximum number of open file descriptors.

# TYPE process_max_fds gauge

process_max_fds 1024

# HELP process_open_fds Number of open file descriptors.

# TYPE process_open_fds gauge

process_open_fds 23

# HELP process_resident_memory_bytes Resident memory size in bytes.

# TYPE process_resident_memory_bytes gauge

process_resident_memory_bytes 9.3446144e+07

...

Out-of-the-box process metrics

Collecting metricsScraping all service instances

# Scrape all targets every 5 seconds by default.

global:

scrape_interval: 5s

evaluation_interval: 5s

scrape_configs:

# Scrape the Prometheus server itself.

- job_name: prometheus

static_configs:

- targets: [localhost:9090]

# Scrape the Gusta service.

- job_name: gusta

static_configs:

- targets: [localhost:8080]

Static configuration

scrape_configs:

# Scrape the Gusta service using Consul.

- job_name: consul

consul_sd_configs:

- server: localhost:8500

relabel_configs:

- source_labels: [__meta_consul_tags]

regex: .*,prod,.*

action: keep

- source_labels: [__meta_consul_service]

target_label: job

Consul service discovery

Target overview

Simple Graph UI

Simple Graph UI

DashboardsHuman-readable metrics

Grafana example

AlertsActionable metrics

ALERT InstanceDown

IF up == 0

FOR 2m

LABELS { severity = "warning" }

ANNOTATIONS {

summary = "Instance down for more than 5 minutes.",

description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.",

}

ALERT RunningOutOfFileDescriptors

IF process_open_fds / process_fds * 100 > 95

FOR 2m


ANNOTATIONS {

summary = "Instance has many open file descriptors.",

description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.",

}

Alert examples

ALERT GustaHighErrorRate

IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m]))

/ sum without(code, instance) (rate(gusta_http_server_requests_total[1m]))

* 100 > 0.1

FOR 2m

LABELS { severity = "critical" }

ANNOTATIONS {

summary = "Gusta service endpoints have a high error rate.",

description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.",

}

ALERT GustaHighLatency

IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1

LABELS { severity = "critical" }

ANNOTATIONS {

summary = "Gusta service endpoints have a high latency.",

description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }}

has a 95% percentile latency of {{ $value }} seconds.",

}

Alert examples

ALERT FilesystemRunningFull

IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0

FOR 1h


ANNOTATIONS {

summary = "Filesystem space is filling up.",

description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }}

is predicted to run out of space within the next 24 hours.",

}

Alert examples

Summary

● Monitoring is essential to run, understand and operate services.● Prometheus

○ Client instrumentation○ Scrape configuration○ Querying○ Dashboards○ Alert rules

● Important Metrics○ Four golden signals: Latency, Traffic, Error, Saturation

● Best practices

Recap

● https://prometheus.io● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/● Our “StackOverflow” https://www.robustperception.io/blog/● Ask the community https://prometheus.io/community/

● Google’s SRE book https://landing.google.com/sre/book/index.html● USE method http://www.brendangregg.com/usemethod.html● My philosophy on alerting https://goo.gl/UnvYhQ

Sources



https://www.reddit.com/r/PrometheusMonitoring/

https://www.robustperception.io/blog/

https://prometheus.io/community/

https://landing.google.com/sre/book/index.html

http://www.brendangregg.com/usemethod.html

https://goo.gl/UnvYhQ

Thank you

Tobias Schmidt - MicroCPH May 17, 2017

github.com/grobie - @dagrobie


https://twitter.com/dagrobie


● High availability

○ Run two identical servers

● Scaling

○ Shard by datacenter / team / service ( / instance )

● Aggregation across Prometheus servers

○ Federation

● Retention time

○ Generic remote storage support available.

● Pull vs. Push

○ Doesn’t matter in practice. Advantages depend on use case.

● Security

○ Focused on writing a monitoring system, left to the user.

FAQ

Engineering

Monitoring microservices with Prometheus