Car Alarms & Smoke Alarms [Monitorama]

Car Alarms & Smoke Alarms& Monitoring

Who’s this punk?

• Dan Slimmon

• @danslimmon on the Twitters

• Senior Platform Engineer at Exosite

• Previously Operations Team Manager at Blue State Digital

https://twitter.com/danslimmon

Learn to do some stats and visualization.

You’ll be right much more often, & people will THINK you’re right even

more often than that!

Signal-To-Noise Ratio

A word problem

You’ve invented an automated test for plagiarism.

• Plagiarism: 90% chance of positive

• No Plagiarism: 20% chance of positive

• Jerkwad kids plagiarize 30% of the time

A word problem

Question 1

Given a random paper, what’s the probability that you’ll get a negative result?


• No Plagiarism: 20% chance of positive

• 30% chance of plagiarism

Question 2

If there’s plagiarism, what’s the probability PLAJR will detect it?


• No plagiarism: 20% chance of positive


Question 2

If there’s plagiarism, what’s the probability you’ll detect it?




Question 3

If you get a positive result, what’s the probability that the paper is plagiarized?




No Plagiarism Plagiarism

No Plagiarism

Negative

Positive

No Plagiarism

Negative

Positive

Plagiarism

Negative

Positive

Question 1

Given a random paper, what’s the probability that you’ll get a negative result?

No Plagiarism

Negative

Positive

Plagiarism

Negative

Positive

Question 2

If the paper is plagiarized, what’s the probability that you’ll get a positive result?

No Plagiarism

Negative

Positive

Plagiarism

Negative

Positive

Question 3

If you get a positive result, what’s the probability that the paper was plagiarized?

No Plagiarism

Negative

Positive

Plagiarism

Negative

Positive

Question 3


Dark Green

------------------------------------------

(Dark Blue) + (Dark Green)

Question 3


27

------------------------------------------

14 + 27

Question 3


65.8%

Sensitivity & Specificity

Sensitivity:

% of actual positives that are identified as such

Specificity:

% of actual negatives that are identified as such


Sensitivity:

High sensitivity

Test is very sensitive to problems

Specificity:

High specificity

Test works for a specific type of problem

Specificity:

Probability that, if a paper isn’t plagiarized, you’ll get a negative.


Sensitivity:

Probability that, if a paper is plagiarized, you’ll get a positive.

90% 80%

Specificity

Sensitivity

Prevalence

http://i.imgur.com/LkxcxLt.png

Positive Predictive Value

The probability that

If you get a positive result,

Then it’s a true positive.

When you get paged at 3 AM, Positive Predictive

Value is the probability that something is actually

wrong.

Imagine if you will...

• Service has 99.9% uptime

• Probe has 99% sensitivity

• Probe has 99% specificity

Pretty decent, right?

Let’s calculate the PPV.

TrueNegative

False Negative

False Positive

TruePositive

PositiveResult

NegativeResult

ConditionPresent

ConditionAbsent

The true-positive probability

P(TP) = (prob. of service failure) * (sensitivity)

P(TP) = 0.1% * 99%

P(TP) = 0.099%

Let’s calculate the probability that any given probe run will produce a true positive.

The true-positive probability

P(TP) = 0.099%

So roughly 1 in every 1000 checks will be a true positive.

The false-positive probability

P(FP) = (prob. working) * (100% - specificity)

P(FP) = 99.9% * 1%

P(FP) = 0.99%

So roughly 1 in every 100 checks will be a false positive.

Positive predictive value

PPV = P(TP) / [P(TP) + P(FP)]

PPV = 0.099% / (0.099% + 0.99%)

PPV = 9.1%

If you get a positive, there’s only a 1 in 10 chance that something’s actually wrong.

Why is this terrible?

Car Alarms

http://inserbia.info/news/wp-content/uploads/2013/06/carthief.jpg

Smoke Alarms

http://www.props.eric-hart.com/wp-content/uploads/2011/03/nysf_firedrill_2011.jpg

You want smoke alarms, not car alarms.

Practical Advice

(Semi-)Practical Advice

Why do we have such noisy checks?

“Office Space”, 1999.

Monty Python’s Flying Circus, 1975.

Semi-Practical Advice

Undetected outages are embarrassing, so we tend to focus on sensitivity.

That’s good.

But be careful with thresholds.


Response Time Threshold

PositivePredictive

Value


Get more degrees of freedom.


Response Time Threshold

PositivePredictive

Value


Hysteresis is a great way to add degrees of freedom.

• State machines

• Time-series analysis


As your uptime increases, so must your specificity.

It affects your PPV much more than sensitivity.

Specificity

Sensitivity

Uptime Prevalence

False Positive

Rate

False Negative Rate

Specificity

Sensitivity

Uptime


Separate the concerns of problem detection and problem identification


• Check Apache process count

• Check swap usage

• Check median HTTP response time

• Check requests/second

Your alerting should tell you whether work is getting

done.Baron Schwartz(paraphrased)




• Check median HTTP response time

• Check requests/second




• Check median HTTP response time & requests/second

A Pony I Want

Something like Nagios, but which

• Helps you separate detection from diagnosis

• Is SNR-aware

• Medical paper with a nice visualization:http://tinyurl.com/specsens

• Blog post with some algebra: http://tinyurl.com/carsmoke

• Base rate fallacy:http://tinyurl.com/brfallacy

• Bischeck:http://tinyurl.com/bischeck

Other useful stuff

http://tinyurl.com/specsens

http://tinyurl.com/carsmoke

http://tinyurl.com/brfallacy

http://tinyurl.com/bischeck

Come find meand chat.

Technology

Car Alarms & Smoke Alarms [Monitorama]