View
1.060
Download
1
Category
Tags:
Preview:
DESCRIPTION
Infovision 2011 Data to Decisions Shailesh Kumar, Google http://informationexcellence.wordpress.com/category/knowledge-share-sessions/ Infovision 2011 Data to Decisions Shailesh Kumar, Google http://informationexcellence.wordpress.com/2011/10/28/infovision2011-presentations/
Citation preview
From Data to Decisions: Learnings from Real-World
Data Mining
Dr. Shailesh Kumar Google, Inc.
InfoVision 2011
Welcome to the Information Age … … drowning in data and starving for Knowledge
ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTATTACGCGACCGTAAGCTAC…
This data explosion is enabled by…
Better “Sensors” – Higher Resolution, More Spectral Bands, Quick Experimental Turnaround, Crowd Sourcing…
Higher Bandwidth Communication – Faster Networks and Routers, Better Compression technologies…
Larger Warehouses – Cheaper Storage, Multi-Level Caching, Scalable Database/Data warehousing technologies…
Massive Crunching Power – Faster Multi-core processors, Parallel Distributed Computing, MapReduce paradigms…
Advances in Machine Learning and Data Mining –Sophisticated Learning frameworks, Distributed Data Mining…
From “Data” to “Decision”
Insights Features
Models
Predictions
Domain Knowledge
Business Objectives Business Constraints
Feedback
Data
Decision
Observation Prediction Decision
Credit Card Fraud Input: Past card usage behavior Predict: Fraudulent transaction?
Credit Scoring Input: Past payment behavior Predict: Probability of Default
Retail Cross Sell Input: Past purchase behavior
Predict: Response to a coupon Approve Transaction? Approve Loan? Send Coupon?
Building Machine Learning Models The Process, the Art, and the Science
Collect Raw (Input) Data
Collect Target (Output) Labels (“ground truth”)
Choose: “Model Type” & “Model Complexity”
Engineer and Select “Predictive” features
“Train” a model using Feature-Label training data set
“Evaluate” the trained model on “validation” data and iterate until satisfied
Can be Costly!!
Too Simple: Under-Learn Too Complex: Over-Learn
Bias Variance Tradeoff
“Deploy” the model: Predict class label of all the “un-labeled” data
• Use Domain Knowledge • Keep variability that matters • Remove Redundancy
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
Looking for a Needle in a Haystack?
What is the nature of my haystack (data) What process generated the data? What assumptions am I making about the data?
Is it the right needle (insight) to look for? Is it “actionable”? Is it “useful”? Is it “novel”? Does it tell me something I didn’t know?
Insight Discovery ≠ Hypothesis Testing
The Traditional Market Basket Analysis Wrong needle in a mysterious haystack!
FREQUENT ITEM-SETS
Size = 1
CANDIDATE ITEM-SETS
Size = 2
FREQUENT ITEM-SETS
Size = 2
CANDIDATE ITEM-SETS
Size = 3
FREQUENT ITEM-SETS
Size = 3
Lesson: Know your data (Haystack) What process generated the data?
mixture of, projections of, latent intentions
already have other products
buy them from another retailer
buy them at a different time
got them as gifts
….
Few buy a complete “logical” product group in the same basket
Lesson: Extract the essence, let go of data Pair-wise Co-occurrence Statistics
Lesson: Look for the right Insight “Frequent” vs. “Logical” Itemset
Novel – Not obvious from the data (support = 0) Useful – product bundling, recommendations, layout Exhaustive – “No insight left behind!” – however “rare”
Airbeds Lighting Folding Furniture
Camping Accessories
Grill Accessories
Inflatables
Water Sports Lighting
Patio Accessories
Furniture
Projection TV Flat Panel TV
Home Theatre Services
Digital Cable TV Home Components
Speakers
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
Two Mindsets to Modeling
Model-Centric • Throw all features in! • Have enough data • Build Complex models
Feature-centric • Carefully craft features • Use Domain Knowledge • Build Simpler Models
Simple Features
Complex Model
Complex Features
Simple Model
The Law of Conservation of Complexity
Lesson: Distribute Complexity well Simplify Models with complex features
Simple Features
Complex Model
Complex Features
Simple Model
Lesson: Overcome model limitations
Age < 60
Income < Rs. 32
Education < 20
Inco
me
Age
Education < 20
log (Income) - B x Age < 12
log
(Inco
me)
Age
?
Lessons from Real-world Data Mining
Insights Text
Features
Labels
Models
Decisions
Lesson: Things are not what they appear What is a word in “Bag-of-Words”?
Segmentation: What is a word? New York Stock Exchange 4 words? “New York” “Stock Exchange” 2 phrases? “New York Stock Exchange” 1 phrase?
Disambiguation: What does a word mean? ‘rock band’, ‘rock climbing’, ‘rocking chair’, ‘the rock’
Equivalencing: How “similar” are two terms? Comparing Apples to Oranges… Orange Juice, Orange Flag, Orange Blog, Apple store, Apple pie, The Big Apple
Equivalencing we filed a suit charging dell of illegal behavior they submitted a case accusing apple of unauthorized conduct
Disambiguation i was right to avoid a suit against apple on my right was a man in a suit drinking apple juice
You shall know a word by the company it keeps -- Firth, J. R. 1957:11
SIMILARITY = 0.995
SIMILARITY = 0.171
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
Labels are precious – use them well Labeled data vs. Unlabeled data
Lots of input data! (e.g. web pages) Small fraction is labeled! (e.g. spam/not)
Labels can be Costly – human judgments, costly experiments, rare events Noisy – web clicks, crowd sourced,…
How do we use unlabeled data with labeled data? Semi-supervised Learning
Which unlabeled data point to get labeled next? Active Learning
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
Lesson: Don’t beat data into submission Model Complexity no more than necessary
How many hidden units in a neural network? How deep a decision tree? How much cost for “misclassification elasticity” in SVM? How many clusters? or modes in mixture of density?
Model is too simple under-learn
Model is too complex memorize
Model is just right generalize
Lesson: Divide and Conquer Many simple models > Single complex model
M W N U F P
V Y S Z B E I J
A
K R
H Q O G
L D
T
X C
• Better “localized features” • Simpler “local models” • More interpretable features and models • Higher Accuracy • Faster Modeling Time • Lower Resource Requirements
Lessons from Real-world Data Mining
Insights
Features
Labels
Models
Decisions
Lesson: Interpret Predictions What is the score? Why is score that way?
Concept Space Prediction Score Overlay
*This is not what we mean by the “art of data mining”
Lesson: Learn Globally, Decide Locally
“The Ford-Firestone dispute blew up in August 2000 and is still going strong. In response to claims that their 15-inch Wilderness AT, radial ATX and ATX II tire treads were separating from the tire core leading to grisly, spectacular crashes. Bridgestone/Firestone recalled 6.5 million tires….” -- Forbes
Accidents description Density Overlay
Lesson: Prediction is not enough! Different Reasons, Different Decisions
Probability of defaulting Collection Notes
Summary Decisions driven more by data than by “gut feeling”
Converting data to decisions is Art + Science + Engineering
Insights: Right needles in a well understood Haystack
Features: Garbage In, Garbage Out
Models: Generalize, don’t Memorize
Labels: Explore thoroughly, Exploit efficiently
Decisions: Right decision for the right reason
Feedback: Adapt features, models, scores, decisions
In theory, theory and practice are same.
In practice, they are not.
-- Lawrence Peter Berra
Questions?
Recommended