142
Metrics and Monitoring Infrastructure Lessons Learned Building Metrics at LinkedIn

Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Embed Size (px)

Citation preview

  • Metrics and Monitoring Infrastructure

    Lessons Learned Building Metrics at LinkedIn

  • If you cant measure it, you cant manage it.

    - W. Edwards Deming

  • It is wrong to suppose that if you cant measure it, you cant manage it a costly myth.

    - W. Edwards Deming

  • Who Am I?

  • Grier JohnsonPlatform Engineer

  • Grier JohnsonProduction Engineer

  • Grier JohnsonSite Reliability Engineer

  • Grier JohnsonService Engineer

  • Grier JohnsonProduction Engineer

  • Grier JohnsonAll these titles mean about the same thing

  • Built out some metrics collection at LinkedIn

  • Architected the alerting system

  • So what makes a Metrics Infrastructure

  • The Metrics Store

  • Metric stores have high writes, low read

  • The read requirements for historical data are even

    lower

  • Metrics Transport

  • You probably just want to use Kafka

  • Now your metrics are pub/sub

  • Add something like Hadoop for data analysis

    maybe?

  • Metrics Emission

  • StandardizeYour metrics

  • StandardizeYour intervals

  • StandardizeYour protocols

  • StandardizeYour namespace

  • Metrics Types

  • Allow for custom metrics

  • Prefer counters

  • Histograms

  • Protect your namespace, seriously

  • Metrics Presentation

  • Allow for customization

    Graph Size

  • Allow for customization

    Start and end times

  • Allow for customization

    Graph type

  • Allow for customization

    Legends

  • Allow for customization

    Color palette

  • Static Graphs and Links

  • Speed Matters

  • Update frequency matters

  • Give a lot of thought to the colors. No really. People

    really care about this.

  • Ok, now how about monitoring and alerting

  • Data Sources

  • Metrics

  • Centralized Checks

  • Decentralized / Distributed Checks

  • Stream Processing

  • Defining Your Alerts

  • What data are we looking at

  • What does it mean for the data to be good or bad

  • What do we do if its bad?

  • Processing

  • Bringing together data and definition

  • Read your data sources

  • Read your definitions

  • Apply the definitions to the data

  • Do something! Or dont.

  • Alerting Actions

  • Nothing, nothing at all.

  • E-mail

  • Alert!

  • Run a script

  • wait what?

  • Use your monitoring system to respond.

  • Automation is better than alerting

  • UI/UX

  • At-a-glance visual for site/service health

  • Point-and-click alert suppression

  • CLI and API for automation tasks

  • Bulk actions, please

  • Stopping Alerts

  • Suppressionor acknowledge

  • Suppressionor sleep

  • Suppressionor quiet

  • Suppressionor silence

  • Allow suppression scheduling

  • Cancelling suppressions

  • So what did I learn

  • Lesson 1Writing your Metrics Infrastructure from scratch

  • Dont Invent It Here

  • Use Open Source

  • Metrics Stores

    Netflixs Atlas (https://github.com/Netflix/atlas)

    Prometheus (https://prometheus.io)

    RackSpaces Blueflood (http://blueflood.io)

    OpenTSDB (http://opentsdb.net)

    https://github.com/Netflix/atlashttps://prometheus.iohttp://blueflood.iohttp://opentsdb.net

  • Alerting

    Sensu (https://sensuapp.org)

    Riemann (http://riemann.io)

    Zabbix (http://www.zabbix.com)

    Nagios (http://actually-dont.com)

    https://sensuapp.orghttp://riemann.iohttp://www.zabbix.comhttp://actually-dont.com

  • Hybridize

  • Contribute

  • Lesson 2When you build your metrics system from scratch

    anyhow

  • Redundancy doesnt matter, until your first outage with data loss

  • Care about thin bandwidth pipes

  • Distribute your stores close to the metrics

    creators

  • Aggregate distributed metrics at the

    presentation layer

  • Cache what you can

  • Use Kafka

  • Lesson 3Graphing UIs

  • DPI Matters

  • When your graph is 500px wide, how many of the 750 data points can you show?

  • Show the highest data points?

  • The lowest?

  • The average?

  • Theyre all trade-offs and someone will hate you for

    it.

  • Javascript is slow

  • It looks good though

  • Hard to e-mail dynamic javascript graphs

  • Remember to plan for caching

  • Outages test your frontends performance

  • Lesson 4Metrics Discovery

  • Why have 100M metrics if you cant find them

  • Even when you find them, you cant makes sense of

    1000 metrics for one service

  • Standard names help. Namespace helps.

    Not having 100M metrics helps

  • Lesson 5Alerting On Metrics

  • Alerting on absolute values is bad

  • Use standard deviation

  • Use rate of change

  • But remember DST and Holidays ruin everything

  • Lesson 6Alerting Overload

  • Alerts are for humans

  • Low friction for alert suppression

  • Low friction for alert changes and customization

  • Lesson 7Alerting Levels

  • There is no Warning level for alerts

  • If it is worth alerting you, its critical

  • Rate of change monitoring will help with most useful cases here

  • Lesson 8Suppression Times

  • Unlimited suppression time = Regret

  • Less than unlimited is OK

  • Low friction alert modifications

  • Lesson 9Processing your metrics from streams for alerts

  • Dont re-read your kafka stream to build up metrics

    history.

  • Its right there in your metrics store

  • Lesson 10Alerting on Logs

  • Dont

  • No really, logs are for humans, rethink your monitoring strategy

  • Lesson 11Alerting on Exceptions

  • Maybe

  • When exceptions are rare, alerting is fine

  • Theres a special place in hell for people that alert on rates of exceptions

  • Im sure theres moreBut if Im not over time by this point I talked REALLY FAST

  • Questions?

    Grier Johnson

    [email protected]

    @grierj on Twitter for DMs

    https://www.linkedin.com/in/grierjohnson

    mailto:[email protected]://www.linkedin.com/in/grierjohnson