Better service monitoring through histograms sv perl 09012016

Preview:

Citation preview

Better service monitoring through histogramsFred Moyer - @phredmoyerSilicon Valley Perl, 09-01-2016

Who likes to wake up for false positives?

Synthetics

Easy to setup, but not a real user

Stephen Falken: Uh, uh, General, what you see on these screens up here is a fantasy; a computer-enhanced hallucination. Those blips are not real missiles. They're phantoms. (War Games, 1983)

Real Users

Real Users

500 ms is really 2,000 ms

Spike Erosion

Threshold Based Alerting

“Alert if a request takes longer than 200 ms”

10,10,10,10,10,10,10,10,10,5000

Alerts on one outlier in 10

Threshold Alerting

“Alert if request average over one minute is longer than 200 ms”

avg(10,10,210,210,210,210) = 143 (860/6)

Does not alert on multiple high samples

Threshold Alerting

‘average’ eq ‘arithmetic mean’A=S/N

A = averageN = the number of samples

S = the sum of the samples in the set

Math Refresher

median = midpoint of data set

The 50th percentile is 555 - q(0.5)

Value 111 222 333 444 555

666

777 888 999

Sample # 1 2 3 4 5 6 7 8 9

Math Refresher

90th percentile - 90% of samples below it

The 90th percentile is 1,000 - q(0.9)

Value 111

222

333

444

555

666

777

888

999 1,00

01,111

Sample #

1 2 3 4 5 6 7 8 9 10 11

Math Refresher

100th Percentile - the maximum value

The 100th percentile is 1,111 - q(1)

Value 111

222

333

444

555

666

777

888

999

1,000 1,11

1Sample #

1 2 3 4 5 6 7 8 9 10 11

Math Refresher

Sample value

Number of samples

Histogram

Sample value

Number of samples

Normal Distribution

Sample value

Number of samples

Normal Distribution

68% within one sigma (σ)

Sample value

Number of samples

Non-Normal Distribution

Sample value

Number of samples

Non-Normal Distribution

Non-Normal Distribution

Operations data groups at different points

Non-Normal Distribution

Users to the right of the red line are gone

Request latency“We keep hearing from people that the

website is slow. But it is fine when we test it, and the request latency graph is

constant”

You are only looking at part of the picture.

Heat Map

Histograms over time windows

Percentiles

Practical PercentilesBandwidth usage is often billed at 95th percentile

usageRecord 5 minute data usage intervals

Sort samples by value of sampleThrow out the highest 5% of samples

Charge usage based on the remaining top sample, i.e. 300 MB transferred over 5 minutes = 1 MB/s rate

billing

Practical Percentiles

If I measure 95th percentile per 5 minutes all month long,

I CANNOT calculate 95th percentile over the month.

Angry users

How many users are you pissing off?

Angry users

“Alert me if request latency 90th percentile over one minute is

exceeded”

Percentile based alerting

q(0.9)[10,10,10,10,10,10,10,10,5000] == 10Alert IS NOT triggered

Do you want to be woken up for this? NO!

“Alert me if request latency 90th percentile over one minute is exceeded”

Percentile based alerting

q(0.9)[10,10,10,10,10,10,250,300] = ~270Alert IS triggered

Do you want to be woken up for this? YES!

Percentile based alerting

Who’s using this approach?

Google.com - in house monitoring systemsCirconus.com - hosted histogram monitoring

You? (I’ve written my own histograms but use Circonus for production systems)

Questions?

Thanks to Circonus for tools and help with math

http://www.circonus.com/free-account/Look for future monitoring talks here soon

http://meetup.com/monitorSF