52
Richard Bijjani Richard Bijjani JUMPING TO CONCLUSIONS (Generating Improbable Insights) Richard Robehr Bijjani, Ph.D. O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci

Jumping to Conclusions

  • Upload
    odsc

  • View
    163

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jumping to Conclusions

Richard BijjaniRichard Bijjani

JUMPING TO CONCLUSIONS(Generating Improbable Insights)

Richard Robehr Bijjani, Ph.D.

O P E ND A T AS C I E N C EC O N F E R E N C E_

BOSTON 2015

@opendatasci

Page 2: Jumping to Conclusions
Page 3: Jumping to Conclusions

• IMAGE: Chart of controlled vs uncontrolled?

• How best to change behaviors

1/3 of all deaths globally are

from cardiovascular disease

SOURCE: WORLD HEALTH ORGANIZATION

Page 4: Jumping to Conclusions

• IMAGE: Chart of controlled vs uncontrolled?

• How best to change behaviors

SOURCE: Mayo Clinic

#1 risk factor is high blood pressure

Page 5: Jumping to Conclusions

SOURCE: WORLD HEALTH ORGANIZATION

1,000,000,000

Page 6: Jumping to Conclusions

CONTINUOUS PASSIVE

CLINICALLY MEANINGFUL

BEHAVIORAL INSIGHTS

Page 7: Jumping to Conclusions

CONTEXTUALIZED CARDIOVASCULAR HEALTH

Page 8: Jumping to Conclusions

CONTEXTUALIZED HUMAN HEALTH

Page 9: Jumping to Conclusions

Quanttus is always on!

We capture > 50 million data

points and > 400,000 vital sign

measurements / person / day.

Page 10: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Data Science @ Quanttus

Data Science ≡ Extraction of Actionable Knowledge from Data

Actionable Knowledge

Better Decisions

Meaningful Insights

Knowledge is actionable iff it has predictive power

(not just an ability to explain the past)

Page 11: Jumping to Conclusions

Richard BijjaniRichard Bijjani

IUMRING TQ CQNGIUSIQNS

Illusion of Knowledge

fatal

The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.

-Stephen Hawking

Page 12: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Illusion of Knowledge

Courtesy of National Geographic

Page 13: Jumping to Conclusions

Richard BijjaniRichard Bijjani

First a Joke!

A police officer approaches a man intently searching the ground under a lamppost

• Policeman: What are you doing?

• Man: Looking for my car keys

The officer helps for a few minutes without success

• Policeman: Are you certain you dropped your keys near here?

• Man : No! I remember dropping them across the street.

• Policeman (very irritated): Why are looking for them here then?

• Man : The light is much better here!

Page 14: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Why Scientific Studies are so often Wrong

• Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman

•15/45 most prominent studies published in the top medical journals were ultimately refuted.

•2/3 of all medical studies are wrong.

•9/10 of leading-edge studies (like those linking a disease to a specific gene) are wrong.

John Ioannidis, University of Ioannina

Page 15: Jumping to Conclusions

Richard BijjaniRichard Bijjani

10% to 20% of cases: delayed, missed, and incorrect diagnosis

garber, et al., jama, 2005

Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman

Why Scientific Studies are so often Wrong

Page 16: Jumping to Conclusions

Richard BijjaniRichard Bijjani

40,000+ patients in US ICU’s may die with a misdiagnosis annually

winters, et al., bmj quality & safety, 2012

Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman

Why Scientific Studies are so often Wrong

Page 17: Jumping to Conclusions

Richard BijjaniRichard Bijjani

50% of MDs are below-averagevinod khosla

Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman

Why Scientific Studies are so often Wrong

Page 18: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Are you Immune to the Streetlight Effect?

Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman

• Think of the data you are working with, is it the ideal data, or just the conveniently available data?

• When was the last time you worked with ideal data?

• Have you ever?

Why Scientific Studies are so often Wrong

Page 19: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Expert Consensus

seventeen experts’ estimates of the effect of screening on colon cancer deaths

0% 25% 50% 75% 100%

proportion of colon cancer deaths prevented

Page 20: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Should you trust your Dr.?

• Depends. • If your ailment is common, your Dr. will do a decent job.

• If you’re suffering from a relatively uncommon disease, not so well.

“If you don’t find it often, you often don’t find it”, Jeremy Wolfe

Page 21: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Weak Link, Humans!

1. Signals with low predictive Values are not very useful

1 in 1000 does not hold ones attention for long

2. Attention directed at one thing, is attention drawn away from something else

Lost research/testing/treatment opportunities

Page 22: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Data Scientists’ Tools of Choice

• Some scientists use only techniques they feel comfortable with

• Others latch on to new ones without fully understanding them.

• Some just rely on available methods built into their software.

Page 23: Jumping to Conclusions

Richard BijjaniRichard Bijjani

The Nonsense Asymmetry Principle

The amount of energy needed to refute ‘Nonsense’ is an order of

magnitude bigger than to produce it.-Alberto Brandolini

Page 24: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Data Scientific Method

• Validation can ONLY occur by measuring the predictive power of the insights, in addition to it’s ability to explain the past

• Data Science is Scienceand hence follows the Scientific Method

Ask Relevant Questions

Report Results

Research. Gather Data.

Analyze Results

Validate Data.Construct Hypothesis

Design Experiment.Test Hypothesis.

Hypothesis is True

Hypothesis not Valid

Page 25: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Ask the Important Question

• Deadly Virus

• Infects 1 in 1 million

• Diagnostic Test developed with 99.9% Sensitivity and Specificity

• Treatment developed• 99% Curative• 1% Deadly side effect

• Question: Would you recommend Diagnosis?• Would you recommend Treatment?

Page 26: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Efficacy of Treatment

• Do Nothing:• 300 People Infected• 300 People will Die

300M test subjects

Predicted Negative

PredictedPositive

Normal 299. 7M 300,000

Infected <1 >299

• Diagnose and Treat:• Infected Population:

• 296 Cured

• 3 + 1 Die

• Non-Infected Population• 297,000 Unaffected (except for the scare)

• 3,000 Die

Page 27: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Good Practices

• Understand were the data comes from.

• Pre-process / Clean your data, but keep validated outliers.

• Own the tools and adapt them to your own requirements.

• Follow the scientific Method.

• Analyze data to answer Question posed.

• Save a list of other interesting questions for later.

• Share your hypotheses with the team.

• Simple is better, at least make sure it’s deployable.

• Test, Validate, re-test.

• Communicate results correctly and set the right expectations.

"If you torture the data long enough, it will confess to anything." - Hal Varian

Page 28: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Page 29: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Pitfalls of data mining

• The hope: data miners pore over large, diffuse sets of raw data trying to discern patterns that would otherwise go undetected.

• The dark side of data mining is to pick and choose from a large set of data to try to explain a small one

• “Given enough time, enough attempts and enough imagination, almost any set of data can be teased out of any conclusion”

Page 30: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Limitations of Common Data Mining Techniques

• Automated feature selection methods cannot apply to rare (or unforeseen) events

• Normal events are similar, rare events are by definition unique

• Accuracy measurements are not appropriate• Real time detection of rare events is necessary,

but machine learning techniques construct models based on the past

• If you haven’t yet seen, you cannot detect it!

Page 31: Jumping to Conclusions

Richard BijjaniRichard Bijjani

What are rare/high-value events?

• Rare or Outliers

• Occurs less then 1%• For large datasets,

many samples exist. Balance could be achieved and traditional Data Mining Techniques could be applied

• Preferential sampling of rare class

• Under-Sampling of majority class

• Extremely Rare or Anomalies

• Statistical chance of detection is zero

• Most databases don’t ‘naturally’ contain any samples

• Properties of target samples are not known

Page 32: Jumping to Conclusions

Richard BijjaniRichard Bijjani

What are the costs of such events?

Cost Functions not easily Defined

Page 33: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Anomalies vs. High Value Rare Events

By definition, anomalies are the exception, but not necessarily rare and/or of high value.

• Anomaly? Yes

• Rare Event? No

Page 34: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Extremely Rare, High Value EventsCase Study: Terrorism, specifically explosive detection

Page 35: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Finding Commercial Explosives

Data Could be Collected and/or simulated. Allowing for rare class augmentation

Page 36: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Finding Explosives

Data cannot be Collected and/or simulated.

Page 37: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Why incompatible?• No Quality

Control

Suicide Bomb Trainer in Iraq Accidentally Blows Up His Class

Terrorist ‘lab’ (redacted)

Page 38: Jumping to Conclusions

Richard BijjaniRichard Bijjani

The Quanttus Vision

Page 39: Jumping to Conclusions

Richard BijjaniRichard Bijjani

The Quanttus Vision

Page 40: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Takeaway

• We are drowning in data, yet starving for knowledge

• In case of rare events, data may not be enough, source of data need to be well understood

• To detect rare events: Sometimes it’s just more effective to generate heuristics

• Heuristics cannot predict, while machine learning assumes the future will resemble the past, and extremely rare events are not part of the past

• What to do?

Page 41: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Outliers revisited

1. Retain outliers in data set for analysis.

2. Exclude only those that are known to be due to defective measurements or transcription errors

1. Need to understand data origin to accurately separate rare events from measurement errors

3. Do not assume normal distribution

Page 42: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Data Mining Techniques

• Supervised

• pro: Human readable models

• con: Requires labeled data

• Unsupervised

• Pro: Deviation detection, no labeling needed

• Con: Requires similarity measures, high false alarm (due to benign yet previously unseen data)

Page 43: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Unsupervised Techniques

• Outlier datum defined as different from the rest of the data

• Rare event: Same definition

• Detection Approaches• Statistics based • Distance Based • Model Based

Page 44: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Unsupervised: Statistics

• Data modeled using stochastic distribution

• Advantages: no a priori knowledge required

• Disadvantages: • Fails with high dimensions (curse of dimensionality)• Does not identify patterns of rare events

• Sample implementations:• Finite Mixtures Schemes, e.g. SmartSifter. Use histogram density to

represent probability distribution• Blocked Adaptive Computationally Efficient Outlier Nominator, BACON• Probability Distributions• Entropy Measures

Page 45: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Unsupervised: Distance

• Distance computed between neighbors, and data points sorted

• Advantages: no a priori knowledge required

• Disadvantages: • Not suitable for rare classes

• Sample implementations:• k-Nearest Neighbor• Mahalanobis Distance for skewed distributions• Local Outlier Factor (LOF) for variable density cluster (average distance

between points are different in different clusters)• Specialized Clustering, Canopy, FindOut

Page 46: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Unsupervised: Model

• Predict normal behavior via model

• Capture deviations

• Detection Approaches• Neural Networks, 4 layers, input = output• Unsupervised support vector machines SVM

Page 47: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Supervised Techniques

• Classification methods typically not suitable:• Problem: Lack of labels• Possible Solution: Balance the class size

• Duplicate rare events or down-size normal events

• Generate anomalies inversely proportional with data density

• Synthetically generate minority over-sampled events (e.g. SMOTE)

• Classify regions as ‘positive’ without having enough data in them

• Shrink: look for presence of positive labels, not majority

• PN-rule: Find regions of high recall (Pd), then prune false positives, then classify (avoid over-fitting)

• Decision Tree methods: Ripple Down Rules, CREDOS, Boosting Classifiers, Random Forest

Page 48: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Cost Functions

• In any classification problem, one needs to minimize the cost function

• Selecting an appropriate cost function is key

• Weighting is also important, not all data points are created equal.

• Bayesian Thinking is necessary

• Temporal (time-series) Analysis requires different approaches.• Is current data ‘surprising’ based on historical data

created with the same underlying process?• Opportunity for Insight, error, or rare event capture.

Page 49: Jumping to Conclusions

Richard BijjaniRichard Bijjani

The weak Link

No matter how good your automated system is, final decision to act or not is often a human!

Present only relevant data to make the right decision

Actionable information

Present data in human readable format.

Visualization! Be creative, different.

Page 50: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Detecting Extremely Rare Events

Data Collection

•Capture high SNR representative data

Pre-process

•Clean the data from known noise and artifacts

Feature Extraction

•Reduce data to meaningful feature with no loss of desired signal

Classifier• Divide data into

training/testing and use appropriate classifiers

• Always use feature confidences

Identify Outliers

•Data that is not ‘normal’

•Determine if physically appropriate or measurement error. Delete errors.

Explain Data

•Any Insights? What does it mean, sub-category classification

Present Data•Visualizati

on, UI, UX,

Page 51: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Conclusion

• Experiment, test. Iterate.

• Do your homework, learn the physical origin of your data.

• Pre-process data.

• Develop your own method, all methods have weaknesses and strength, learn to combine.

• Know your customer. Stay focused on the ‘Question’.

• Simplify. Needs to run in the real world

• Avoid Bias, Biased Samples Biased Outcome

• Never use the test and validation data in the training phase, not even for scaling purposes.

Page 52: Jumping to Conclusions

Richard BijjaniRichard Bijjani

Thank you.

www.Quanttus.com

@Quanttus

www.facebook.com/Quanttus