Upload
dan-slimmon
View
4.526
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Nobody likes false negatives. When your Nagios probes fail to detect a problem, it can hurt your sales, your reputation, and even your ego (especially your ego). The solution: tune the thresholds. Right? You can handle a couple spurious late-night pages if it means you’ll reliably detect real failures. I will argue that – while easy – exchanging false negatives for false positives does more harm than good. Borrowing the medical concepts of specificity and sensitivity, I’ll show how deceptive this tradeoff can be. I’ll also make the case that putting in the extra effort to minimize both types of falsehoods is necessary and healthy. When the alarm goes off, you shouldn’t have to spend precious minutes sniffing for smoke.
Citation preview
Car Alarms & Smoke Alarms& Monitoring
Who’s this punk?
• Dan Slimmon
• @danslimmon on the Twitters
• Senior Platform Engineer at Exosite
• Previously Operations Team Manager at Blue State Digital
Learn to do some stats and visualization.
You’ll be right much more often, & people will THINK you’re right even
more often than that!
Signal-To-Noise Ratio
A word problem
You’ve invented an automated test for plagiarism.
• Plagiarism: 90% chance of positive
• No Plagiarism: 20% chance of positive
• Jerkwad kids plagiarize 30% of the time
A word problem
Question 1
Given a random paper, what’s the probability that you’ll get a negative result?
• Plagiarism: 90% chance of positive
• No Plagiarism: 20% chance of positive
• 30% chance of plagiarism
Question 2
If there’s plagiarism, what’s the probability PLAJR will detect it?
• Plagiarism: 90% chance of positive
• No plagiarism: 20% chance of positive
• 30% chance of plagiarism
Question 2
If there’s plagiarism, what’s the probability you’ll detect it?
• Plagiarism: 90% chance of positive
• No plagiarism: 20% chance of positive
• 30% chance of plagiarism
Question 3
If you get a positive result, what’s the probability that the paper is plagiarized?
• Plagiarism: 90% chance of positive
• No plagiarism: 20% chance of positive
• 30% chance of plagiarism
No Plagiarism Plagiarism
No Plagiarism
Negative
Positive
No Plagiarism
Negative
Positive
Plagiarism
Negative
Positive
Question 1
Given a random paper, what’s the probability that you’ll get a negative result?
No Plagiarism
Negative
Positive
Plagiarism
Negative
Positive
Question 2
If the paper is plagiarized, what’s the probability that you’ll get a positive result?
No Plagiarism
Negative
Positive
Plagiarism
Negative
Positive
Question 3
If you get a positive result, what’s the probability that the paper was plagiarized?
No Plagiarism
Negative
Positive
Plagiarism
Negative
Positive
Question 3
If you get a positive result, what’s the probability that the paper was plagiarized?
Dark Green
------------------------------------------
(Dark Blue) + (Dark Green)
Question 3
If you get a positive result, what’s the probability that the paper was plagiarized?
27
------------------------------------------
14 + 27
Question 3
If you get a positive result, what’s the probability that the paper was plagiarized?
65.8%
Sensitivity & Specificity
Sensitivity:
% of actual positives that are identified as such
Specificity:
% of actual negatives that are identified as such
Sensitivity & Specificity
Sensitivity:
High sensitivity
Test is very sensitive to problems
Specificity:
High specificity
Test works for a specific type of problem
Specificity:
Probability that, if a paper isn’t plagiarized, you’ll get a negative.
Sensitivity & Specificity
Sensitivity:
Probability that, if a paper is plagiarized, you’ll get a positive.
90% 80%
Specificity
Sensitivity
Prevalence
http://i.imgur.com/LkxcxLt.png
Positive Predictive Value
The probability that
If you get a positive result,
Then it’s a true positive.
When you get paged at 3 AM, Positive Predictive
Value is the probability that something is actually
wrong.
Imagine if you will...
• Service has 99.9% uptime
• Probe has 99% sensitivity
• Probe has 99% specificity
Pretty decent, right?
Let’s calculate the PPV.
TrueNegative
False Negative
False Positive
TruePositive
PositiveResult
NegativeResult
ConditionPresent
ConditionAbsent
The true-positive probability
P(TP) = (prob. of service failure) * (sensitivity)
P(TP) = 0.1% * 99%
P(TP) = 0.099%
Let’s calculate the probability that any given probe run will produce a true positive.
The true-positive probability
P(TP) = 0.099%
So roughly 1 in every 1000 checks will be a true positive.
The false-positive probability
P(FP) = (prob. working) * (100% - specificity)
P(FP) = 99.9% * 1%
P(FP) = 0.99%
So roughly 1 in every 100 checks will be a false positive.
Positive predictive value
PPV = P(TP) / [P(TP) + P(FP)]
PPV = 0.099% / (0.099% + 0.99%)
PPV = 9.1%
If you get a positive, there’s only a 1 in 10 chance that something’s actually wrong.
Why is this terrible?
Car Alarms
http://inserbia.info/news/wp-content/uploads/2013/06/carthief.jpg
Smoke Alarms
http://www.props.eric-hart.com/wp-content/uploads/2011/03/nysf_firedrill_2011.jpg
You want smoke alarms, not car alarms.
Practical Advice
(Semi-)Practical Advice
Why do we have such noisy checks?
“Office Space”, 1999.
Monty Python’s Flying Circus, 1975.
Semi-Practical Advice
Undetected outages are embarrassing, so we tend to focus on sensitivity.
That’s good.
But be careful with thresholds.
Semi-Practical Advice
Response Time Threshold
PositivePredictive
Value
Semi-Practical Advice
Get more degrees of freedom.
Semi-Practical Advice
Response Time Threshold
PositivePredictive
Value
Semi-Practical Advice
Hysteresis is a great way to add degrees of freedom.
• State machines
• Time-series analysis
Semi-Practical Advice
As your uptime increases, so must your specificity.
It affects your PPV much more than sensitivity.
Specificity
Sensitivity
Uptime Prevalence
False Positive
Rate
False Negative Rate
Specificity
Sensitivity
Uptime
Semi-Practical Advice
Separate the concerns of problem detection and problem identification
Semi-Practical Advice
• Check Apache process count
• Check swap usage
• Check median HTTP response time
• Check requests/second
Your alerting should tell you whether work is getting
done.Baron Schwartz(paraphrased)
Semi-Practical Advice
• Check Apache process count
• Check swap usage
• Check median HTTP response time
• Check requests/second
Semi-Practical Advice
• Check Apache process count
• Check swap usage
• Check median HTTP response time & requests/second
A Pony I Want
Something like Nagios, but which
• Helps you separate detection from diagnosis
• Is SNR-aware
• Medical paper with a nice visualization:http://tinyurl.com/specsens
• Blog post with some algebra: http://tinyurl.com/carsmoke
• Base rate fallacy:http://tinyurl.com/brfallacy
• Bischeck:http://tinyurl.com/bischeck
Other useful stuff
Come find meand chat.