Upload
odsc
View
163
Download
0
Tags:
Embed Size (px)
Citation preview
Richard BijjaniRichard Bijjani
JUMPING TO CONCLUSIONS(Generating Improbable Insights)
Richard Robehr Bijjani, Ph.D.
O P E ND A T AS C I E N C EC O N F E R E N C E_
BOSTON 2015
@opendatasci
• IMAGE: Chart of controlled vs uncontrolled?
• How best to change behaviors
1/3 of all deaths globally are
from cardiovascular disease
SOURCE: WORLD HEALTH ORGANIZATION
• IMAGE: Chart of controlled vs uncontrolled?
• How best to change behaviors
SOURCE: Mayo Clinic
#1 risk factor is high blood pressure
SOURCE: WORLD HEALTH ORGANIZATION
1,000,000,000
CONTINUOUS PASSIVE
CLINICALLY MEANINGFUL
BEHAVIORAL INSIGHTS
CONTEXTUALIZED CARDIOVASCULAR HEALTH
CONTEXTUALIZED HUMAN HEALTH
Quanttus is always on!
We capture > 50 million data
points and > 400,000 vital sign
measurements / person / day.
Richard BijjaniRichard Bijjani
Data Science @ Quanttus
Data Science ≡ Extraction of Actionable Knowledge from Data
Actionable Knowledge
Better Decisions
Meaningful Insights
Knowledge is actionable iff it has predictive power
(not just an ability to explain the past)
Richard BijjaniRichard Bijjani
IUMRING TQ CQNGIUSIQNS
Illusion of Knowledge
fatal
The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.
-Stephen Hawking
Richard BijjaniRichard Bijjani
Illusion of Knowledge
Courtesy of National Geographic
Richard BijjaniRichard Bijjani
First a Joke!
A police officer approaches a man intently searching the ground under a lamppost
• Policeman: What are you doing?
• Man: Looking for my car keys
The officer helps for a few minutes without success
• Policeman: Are you certain you dropped your keys near here?
• Man : No! I remember dropping them across the street.
• Policeman (very irritated): Why are looking for them here then?
• Man : The light is much better here!
Richard BijjaniRichard Bijjani
Why Scientific Studies are so often Wrong
• Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman
•15/45 most prominent studies published in the top medical journals were ultimately refuted.
•2/3 of all medical studies are wrong.
•9/10 of leading-edge studies (like those linking a disease to a specific gene) are wrong.
John Ioannidis, University of Ioannina
Richard BijjaniRichard Bijjani
10% to 20% of cases: delayed, missed, and incorrect diagnosis
garber, et al., jama, 2005
Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman
Why Scientific Studies are so often Wrong
Richard BijjaniRichard Bijjani
40,000+ patients in US ICU’s may die with a misdiagnosis annually
winters, et al., bmj quality & safety, 2012
Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman
Why Scientific Studies are so often Wrong
Richard BijjaniRichard Bijjani
50% of MDs are below-averagevinod khosla
Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman
Why Scientific Studies are so often Wrong
Richard BijjaniRichard Bijjani
Are you Immune to the Streetlight Effect?
Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman
• Think of the data you are working with, is it the ideal data, or just the conveniently available data?
• When was the last time you worked with ideal data?
• Have you ever?
Why Scientific Studies are so often Wrong
Richard BijjaniRichard Bijjani
Expert Consensus
seventeen experts’ estimates of the effect of screening on colon cancer deaths
0% 25% 50% 75% 100%
proportion of colon cancer deaths prevented
Richard BijjaniRichard Bijjani
Should you trust your Dr.?
• Depends. • If your ailment is common, your Dr. will do a decent job.
• If you’re suffering from a relatively uncommon disease, not so well.
“If you don’t find it often, you often don’t find it”, Jeremy Wolfe
Richard BijjaniRichard Bijjani
Weak Link, Humans!
1. Signals with low predictive Values are not very useful
1 in 1000 does not hold ones attention for long
2. Attention directed at one thing, is attention drawn away from something else
Lost research/testing/treatment opportunities
Richard BijjaniRichard Bijjani
Data Scientists’ Tools of Choice
• Some scientists use only techniques they feel comfortable with
• Others latch on to new ones without fully understanding them.
• Some just rely on available methods built into their software.
Richard BijjaniRichard Bijjani
The Nonsense Asymmetry Principle
The amount of energy needed to refute ‘Nonsense’ is an order of
magnitude bigger than to produce it.-Alberto Brandolini
Richard BijjaniRichard Bijjani
Data Scientific Method
• Validation can ONLY occur by measuring the predictive power of the insights, in addition to it’s ability to explain the past
• Data Science is Scienceand hence follows the Scientific Method
Ask Relevant Questions
Report Results
Research. Gather Data.
Analyze Results
Validate Data.Construct Hypothesis
Design Experiment.Test Hypothesis.
Hypothesis is True
Hypothesis not Valid
Richard BijjaniRichard Bijjani
Ask the Important Question
• Deadly Virus
• Infects 1 in 1 million
• Diagnostic Test developed with 99.9% Sensitivity and Specificity
• Treatment developed• 99% Curative• 1% Deadly side effect
• Question: Would you recommend Diagnosis?• Would you recommend Treatment?
Richard BijjaniRichard Bijjani
Efficacy of Treatment
• Do Nothing:• 300 People Infected• 300 People will Die
300M test subjects
Predicted Negative
PredictedPositive
Normal 299. 7M 300,000
Infected <1 >299
• Diagnose and Treat:• Infected Population:
• 296 Cured
• 3 + 1 Die
• Non-Infected Population• 297,000 Unaffected (except for the scare)
• 3,000 Die
Richard BijjaniRichard Bijjani
Good Practices
• Understand were the data comes from.
• Pre-process / Clean your data, but keep validated outliers.
• Own the tools and adapt them to your own requirements.
• Follow the scientific Method.
• Analyze data to answer Question posed.
• Save a list of other interesting questions for later.
• Share your hypotheses with the team.
• Simple is better, at least make sure it’s deployable.
• Test, Validate, re-test.
• Communicate results correctly and set the right expectations.
"If you torture the data long enough, it will confess to anything." - Hal Varian
Richard BijjaniRichard Bijjani
Richard BijjaniRichard Bijjani
Pitfalls of data mining
• The hope: data miners pore over large, diffuse sets of raw data trying to discern patterns that would otherwise go undetected.
• The dark side of data mining is to pick and choose from a large set of data to try to explain a small one
• “Given enough time, enough attempts and enough imagination, almost any set of data can be teased out of any conclusion”
Richard BijjaniRichard Bijjani
Limitations of Common Data Mining Techniques
• Automated feature selection methods cannot apply to rare (or unforeseen) events
• Normal events are similar, rare events are by definition unique
• Accuracy measurements are not appropriate• Real time detection of rare events is necessary,
but machine learning techniques construct models based on the past
• If you haven’t yet seen, you cannot detect it!
Richard BijjaniRichard Bijjani
What are rare/high-value events?
• Rare or Outliers
• Occurs less then 1%• For large datasets,
many samples exist. Balance could be achieved and traditional Data Mining Techniques could be applied
• Preferential sampling of rare class
• Under-Sampling of majority class
• Extremely Rare or Anomalies
• Statistical chance of detection is zero
• Most databases don’t ‘naturally’ contain any samples
• Properties of target samples are not known
Richard BijjaniRichard Bijjani
What are the costs of such events?
Cost Functions not easily Defined
Richard BijjaniRichard Bijjani
Anomalies vs. High Value Rare Events
By definition, anomalies are the exception, but not necessarily rare and/or of high value.
• Anomaly? Yes
• Rare Event? No
Richard BijjaniRichard Bijjani
Extremely Rare, High Value EventsCase Study: Terrorism, specifically explosive detection
Richard BijjaniRichard Bijjani
Finding Commercial Explosives
Data Could be Collected and/or simulated. Allowing for rare class augmentation
Richard BijjaniRichard Bijjani
Finding Explosives
Data cannot be Collected and/or simulated.
Richard BijjaniRichard Bijjani
Why incompatible?• No Quality
Control
Suicide Bomb Trainer in Iraq Accidentally Blows Up His Class
Terrorist ‘lab’ (redacted)
Richard BijjaniRichard Bijjani
The Quanttus Vision
Richard BijjaniRichard Bijjani
The Quanttus Vision
Richard BijjaniRichard Bijjani
Takeaway
• We are drowning in data, yet starving for knowledge
• In case of rare events, data may not be enough, source of data need to be well understood
• To detect rare events: Sometimes it’s just more effective to generate heuristics
• Heuristics cannot predict, while machine learning assumes the future will resemble the past, and extremely rare events are not part of the past
• What to do?
Richard BijjaniRichard Bijjani
Outliers revisited
1. Retain outliers in data set for analysis.
2. Exclude only those that are known to be due to defective measurements or transcription errors
1. Need to understand data origin to accurately separate rare events from measurement errors
3. Do not assume normal distribution
Richard BijjaniRichard Bijjani
Data Mining Techniques
• Supervised
• pro: Human readable models
• con: Requires labeled data
• Unsupervised
• Pro: Deviation detection, no labeling needed
• Con: Requires similarity measures, high false alarm (due to benign yet previously unseen data)
Richard BijjaniRichard Bijjani
Unsupervised Techniques
• Outlier datum defined as different from the rest of the data
• Rare event: Same definition
• Detection Approaches• Statistics based • Distance Based • Model Based
Richard BijjaniRichard Bijjani
Unsupervised: Statistics
• Data modeled using stochastic distribution
• Advantages: no a priori knowledge required
• Disadvantages: • Fails with high dimensions (curse of dimensionality)• Does not identify patterns of rare events
• Sample implementations:• Finite Mixtures Schemes, e.g. SmartSifter. Use histogram density to
represent probability distribution• Blocked Adaptive Computationally Efficient Outlier Nominator, BACON• Probability Distributions• Entropy Measures
Richard BijjaniRichard Bijjani
Unsupervised: Distance
• Distance computed between neighbors, and data points sorted
• Advantages: no a priori knowledge required
• Disadvantages: • Not suitable for rare classes
• Sample implementations:• k-Nearest Neighbor• Mahalanobis Distance for skewed distributions• Local Outlier Factor (LOF) for variable density cluster (average distance
between points are different in different clusters)• Specialized Clustering, Canopy, FindOut
Richard BijjaniRichard Bijjani
Unsupervised: Model
• Predict normal behavior via model
• Capture deviations
• Detection Approaches• Neural Networks, 4 layers, input = output• Unsupervised support vector machines SVM
Richard BijjaniRichard Bijjani
Supervised Techniques
• Classification methods typically not suitable:• Problem: Lack of labels• Possible Solution: Balance the class size
• Duplicate rare events or down-size normal events
• Generate anomalies inversely proportional with data density
• Synthetically generate minority over-sampled events (e.g. SMOTE)
• Classify regions as ‘positive’ without having enough data in them
• Shrink: look for presence of positive labels, not majority
• PN-rule: Find regions of high recall (Pd), then prune false positives, then classify (avoid over-fitting)
• Decision Tree methods: Ripple Down Rules, CREDOS, Boosting Classifiers, Random Forest
Richard BijjaniRichard Bijjani
Cost Functions
• In any classification problem, one needs to minimize the cost function
• Selecting an appropriate cost function is key
• Weighting is also important, not all data points are created equal.
• Bayesian Thinking is necessary
• Temporal (time-series) Analysis requires different approaches.• Is current data ‘surprising’ based on historical data
created with the same underlying process?• Opportunity for Insight, error, or rare event capture.
Richard BijjaniRichard Bijjani
The weak Link
No matter how good your automated system is, final decision to act or not is often a human!
Present only relevant data to make the right decision
Actionable information
Present data in human readable format.
Visualization! Be creative, different.
Richard BijjaniRichard Bijjani
Detecting Extremely Rare Events
Data Collection
•Capture high SNR representative data
Pre-process
•Clean the data from known noise and artifacts
Feature Extraction
•Reduce data to meaningful feature with no loss of desired signal
Classifier• Divide data into
training/testing and use appropriate classifiers
• Always use feature confidences
Identify Outliers
•Data that is not ‘normal’
•Determine if physically appropriate or measurement error. Delete errors.
Explain Data
•Any Insights? What does it mean, sub-category classification
Present Data•Visualizati
on, UI, UX,
Richard BijjaniRichard Bijjani
Conclusion
• Experiment, test. Iterate.
• Do your homework, learn the physical origin of your data.
• Pre-process data.
• Develop your own method, all methods have weaknesses and strength, learn to combine.
• Know your customer. Stay focused on the ‘Question’.
• Simplify. Needs to run in the real world
• Avoid Bias, Biased Samples Biased Outcome
• Never use the test and validation data in the training phase, not even for scaling purposes.
Richard BijjaniRichard Bijjani
Thank you.
www.Quanttus.com
@Quanttus
www.facebook.com/Quanttus