26
www.dataloop.io | @dataloopio | [email protected] Monitoring for Online Services

Analytics driven operations - Steve Acreman - Dataloop

Embed Size (px)

Citation preview

Page 1: Analytics driven operations - Steve Acreman - Dataloop

www.dataloop.io | @dataloopio | [email protected]

Monitoring for Online Services

Page 2: Analytics driven operations - Steve Acreman - Dataloop

What is Dataloop?

PerformanceUp / Down Alerts

Dev Env Enterprise Stuff

Page 3: Analytics driven operations - Steve Acreman - Dataloop

Architecture

Page 4: Analytics driven operations - Steve Acreman - Dataloop

First Year

Page 5: Analytics driven operations - Steve Acreman - Dataloop

First Year

Page 6: Analytics driven operations - Steve Acreman - Dataloop

Measure

Page 7: Analytics driven operations - Steve Acreman - Dataloop

Putting out the fire

rollup workermetric worker

Page 8: Analytics driven operations - Steve Acreman - Dataloop

Problems

• NodeJS metrics workers not scaling

• Memory management was an issue

• Needed big caches to reduce database

load

• GC cycles too long

• 8 x single processes on an 8 core server

Page 9: Analytics driven operations - Steve Acreman - Dataloop

Metric worker re-write

• Approximately 6 weeks from no Erlang experience to working

version

• No more crashes

• Reduced servers needed from 16 to 8

• Pushes metrics straight from Rabbit into DalmatinerDB (new

database)

Page 10: Analytics driven operations - Steve Acreman - Dataloop

Today

Page 11: Analytics driven operations - Steve Acreman - Dataloop

Happy Ending

Page 12: Analytics driven operations - Steve Acreman - Dataloop

Just the beginning!

Page 13: Analytics driven operations - Steve Acreman - Dataloop

Initial Instrumentation

› StatsD libraries in Node and Erlang code› Push UDP packets to a StatsD server for aggregation

Page 14: Analytics driven operations - Steve Acreman - Dataloop

Pitfalls

› Metrics increase as service usage increases

› UDP isn’t great

› Aggregates across a service (hard to spot an outlier)

› Quite lossy

Page 15: Analytics driven operations - Steve Acreman - Dataloop

Better Instrumentation

› Prometheus http metrics endpoints

› 10 second scrape interval into Dataloop

› Raw data (no loss)

› Dimensions allow drill down into host

Page 16: Analytics driven operations - Steve Acreman - Dataloop

Prometheus Output

curl http://localhost/metrics

Page 17: Analytics driven operations - Steve Acreman - Dataloop

What to instrument?› Everything!

› Feature usage

› Throughput

› Error rates

› If it moves instrument it

Page 18: Analytics driven operations - Steve Acreman - Dataloop

Analytics

› Simple things like API response times

Page 19: Analytics driven operations - Steve Acreman - Dataloop

Analytics› Pretty useful to plot when a problem started

Page 20: Analytics driven operations - Steve Acreman - Dataloop

Yesterday vs. Today

Page 21: Analytics driven operations - Steve Acreman - Dataloop

SQL Like Query Language

Page 22: Analytics driven operations - Steve Acreman - Dataloop

Time Series Functions

› Create a query to answer questions

Page 23: Analytics driven operations - Steve Acreman - Dataloop

Future

› Prediction algorithms

› Search ‘similar’ metrics

› Outlier algorithms

› More functions!

Page 24: Analytics driven operations - Steve Acreman - Dataloop

Summary

› Code level metrics with Prometheus are extremely light weight

› Have a framework in place to quickly add more when issues arise

› Don’t wait until your first fire to start

› Start small and try to get both operations and developers on board

Page 25: Analytics driven operations - Steve Acreman - Dataloop

Q&A

Page 26: Analytics driven operations - Steve Acreman - Dataloop

www.dataloop.io

@dataloopio

[email protected]