Infovision 2011 Data to Decisions Shailesh Kumar, Google

From Data to Decisions: Learnings from Real-World

Data Mining

Dr. Shailesh Kumar Google, Inc.

InfoVision 2011

Welcome to the Information Age … … drowning in data and starving for Knowledge

ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTATTACGCGACCGTAAGCTAC…

This data explosion is enabled by…

  Better “Sensors” – Higher Resolution, More Spectral Bands, Quick Experimental Turnaround, Crowd Sourcing…

  Higher Bandwidth Communication – Faster Networks and Routers, Better Compression technologies…

  Larger Warehouses – Cheaper Storage, Multi-Level Caching, Scalable Database/Data warehousing technologies…

  Massive Crunching Power – Faster Multi-core processors, Parallel Distributed Computing, MapReduce paradigms…

  Advances in Machine Learning and Data Mining –Sophisticated Learning frameworks, Distributed Data Mining…

From “Data” to “Decision”

Insights Features

Models

Predictions

Domain Knowledge

Business Objectives Business Constraints

Feedback

Decision

Observation Prediction Decision

Credit Card Fraud Input: Past card usage behavior Predict: Fraudulent transaction?

Credit Scoring Input: Past payment behavior Predict: Probability of Default

Retail Cross Sell Input: Past purchase behavior

Predict: Response to a coupon Approve Transaction? Approve Loan? Send Coupon?

Building Machine Learning Models The Process, the Art, and the Science

Collect Raw (Input) Data

Collect Target (Output) Labels (“ground truth”)

Choose: “Model Type” & “Model Complexity”

Engineer and Select “Predictive” features

“Train” a model using Feature-Label training data set

“Evaluate” the trained model on “validation” data and iterate until satisfied

Can be Costly!!

Too Simple: Under-Learn Too Complex: Over-Learn

Bias Variance Tradeoff

“Deploy” the model: Predict class label of all the “un-labeled” data

•  Use Domain Knowledge •  Keep variability that matters •  Remove Redundancy

Lessons from Real-world Data Mining

Insights

Features

Labels

Models

Decisions

Looking for a Needle in a Haystack?

  What is the nature of my haystack (data)   What process generated the data?   What assumptions am I making about the data?

  Is it the right needle (insight) to look for?   Is it “actionable”? Is it “useful”? Is it “novel”?   Does it tell me something I didn’t know?

Insight Discovery ≠ Hypothesis Testing

The Traditional Market Basket Analysis Wrong needle in a mysterious haystack!

FREQUENT ITEM-SETS

Size = 1

CANDIDATE ITEM-SETS

Size = 2

FREQUENT ITEM-SETS

Size = 2

CANDIDATE ITEM-SETS

Size = 3

FREQUENT ITEM-SETS

Size = 3

Lesson: Know your data (Haystack) What process generated the data?

mixture of, projections of, latent intentions

  already have other products

  buy them from another retailer

  buy them at a different time

  got them as gifts

  ….

Few buy a complete “logical” product group in the same basket

Lesson: Extract the essence, let go of data Pair-wise Co-occurrence Statistics

Lesson: Look for the right Insight “Frequent” vs. “Logical” Itemset

  Novel – Not obvious from the data (support = 0)   Useful – product bundling, recommendations, layout   Exhaustive – “No insight left behind!” – however “rare”

Airbeds Lighting Folding Furniture

Camping Accessories

Grill Accessories

Inflatables

Water Sports Lighting

Patio Accessories

Furniture

Projection TV Flat Panel TV

Home Theatre Services

Digital Cable TV Home Components

Speakers

Insights

Features

Labels

Models

Decisions

Two Mindsets to Modeling

Model-Centric •  Throw all features in! •  Have enough data •  Build Complex models

Feature-centric •  Carefully craft features •  Use Domain Knowledge •  Build Simpler Models

Simple Features

Complex Model

Complex Features

Simple Model

The Law of Conservation of Complexity

Lesson: Distribute Complexity well Simplify Models with complex features

Simple Features

Complex Model

Complex Features

Simple Model

Lesson: Overcome model limitations

Age < 60

Income < Rs. 32

Education < 20

log (Income) - B x Age < 12

Insights Text

Features

Labels

Models

Decisions

Lesson: Things are not what they appear What is a word in “Bag-of-Words”?

  Segmentation: What is a word?   New York Stock Exchange 4 words?   “New York” “Stock Exchange” 2 phrases?   “New York Stock Exchange” 1 phrase?

  Disambiguation: What does a word mean?   ‘rock band’, ‘rock climbing’,   ‘rocking chair’, ‘the rock’

  Equivalencing: How “similar” are two terms?   Comparing Apples to Oranges…   Orange Juice, Orange Flag, Orange Blog,   Apple store, Apple pie, The Big Apple

Equivalencing   we filed a suit charging dell of illegal behavior   they submitted a case accusing apple of unauthorized conduct

Disambiguation   i was right to avoid a suit against apple   on my right was a man in a suit drinking apple juice

You shall know a word by the company it keeps -- Firth, J. R. 1957:11

SIMILARITY = 0.995

SIMILARITY = 0.171

Insights

Features

Labels

Models

Decisions

Labels are precious – use them well   Labeled data vs. Unlabeled data

  Lots of input data! (e.g. web pages)   Small fraction is labeled! (e.g. spam/not)

  Labels can be   Costly – human judgments, costly experiments, rare events   Noisy – web clicks, crowd sourced,…

  How do we use unlabeled data with labeled data?   Semi-supervised Learning

  Which unlabeled data point to get labeled next?   Active Learning

Insights

Features

Labels

Models

Decisions

Lesson: Don’t beat data into submission Model Complexity no more than necessary

  How many hidden units in a neural network?   How deep a decision tree?   How much cost for “misclassification elasticity” in SVM?   How many clusters? or modes in mixture of density?

Model is too simple under-learn

Model is too complex memorize

Model is just right generalize

Lesson: Divide and Conquer Many simple models > Single complex model

M W N U F P

V Y S Z B E I J

H Q O G

•  Better “localized features” •  Simpler “local models” •  More interpretable features and models •  Higher Accuracy •  Faster Modeling Time •  Lower Resource Requirements

Insights

Features

Labels

Models

Decisions

Lesson: Interpret Predictions What is the score? Why is score that way?

Concept Space Prediction Score Overlay

*This is not what we mean by the “art of data mining”

Lesson: Learn Globally, Decide Locally

“The Ford-Firestone dispute blew up in August 2000 and is still going strong. In response to claims that their 15-inch Wilderness AT, radial ATX and ATX II tire treads were separating from the tire core leading to grisly, spectacular crashes. Bridgestone/Firestone recalled 6.5 million tires….” -- Forbes

Accidents description Density Overlay

Lesson: Prediction is not enough! Different Reasons, Different Decisions

Probability of defaulting Collection Notes

Summary   Decisions driven more by data than by “gut feeling”

  Converting data to decisions is Art + Science + Engineering

  Insights: Right needles in a well understood Haystack

  Features: Garbage In, Garbage Out

  Models: Generalize, don’t Memorize

  Labels: Explore thoroughly, Exploit efficiently

  Decisions: Right decision for the right reason

  Feedback: Adapt features, models, scores, decisions

In theory, theory and practice are same.

In practice, they are not.

-- Lawrence Peter Berra

Questions?

Infovision 2011 Data to Decisions Shailesh Kumar, Google

Technology

Infovision sunil shirguppi _ large scale BI and analytics

Infovision jayaprakash_ agile bi

Shailesh Complete Project

InfoVision Consultants - About our IT Staffing and Service Solutions

BRAHMBHATT VAIDEHI SHAILESH BHA

shailesh mtech

Shailesh Project

Infovision Anand S _ no sql workshop

Shailesh kumar final training report

CSP586 Project - Maurya, Shailesh

shailesh paper 2 - gfgc.kar.nic.in

Gopinath by Shailesh Chavda

Afea presentazione InfoBusiness - formatgroup.it · Modulo InfoVision - Funzionalità Con InfoVision hai la possibilità di: Stabilire quali sono i dati da estrarre per prendere le

Bmc pio by shailesh gandhi

4. Prof. Shailesh R. Dave CV/Shailesh R. Dave.pdfResumé Prof. Shailesh R. Dave Head, Department of Microbiology & Biotechnology, School of Sciences, Gujarat University, Ahmedabad

Shailesh Final Sip

Infovision ravi padaki _ product management 101 for bi and analytics

Infovision panel the future of smac

Shailesh GCM fINAL

Presented by: Shailesh Deshpande (shailesh@vt)