Before Kaggle

Before Kaggle From a business goal to a ML problem

Pierre Gu(errez @prrgu(errez

•  Data Science competitions platform (There are others : DataScience.net in France)

•  332,000 Data Scientists

•  today : 192 competitions, 18 active + 516 In class, 12 active •  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex…

What is ?

•  Price pool? •  325,000 $ to make on August 31st •  Good luck with that ! •  Not a good hourly wage

•  today : 192 competitions, 18 active Understand : •  Lot’s of datasets about approximately every DS topic •  Lot’s of winner solutions, tip and tricks, etc… •  Lot’s of “beat the benchmark” for beginners

I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ...

Why should I join ?

Most of the time:

•  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins

What is a Data Science Competition ?

Most of the time:

•  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins

What is a Data Science Competition?

Why AUC? F1 score? Log loss?

Could that depend on my train/test split?

Where do they come from ? Do you always have some?

Why is the split this way? Random? Time?

What you don’t learn on Kaggle (or in class?): •  How to model a business question into a ML problem. •  How to manage/create labels. (proxy / missing…) •  How to evaluate a model:

•  How to choose your metric •  How to design your train/test split •  How to account for this in feature engineering

Understanding this actually helps you in Kaggle competition : •  How to design your cross validation scheme (and not overfit) •  How to create relevant features •  Hacks and tricks (leak exploitation J)

What is a Data Science Competition?

Scikit learn cheat sheet

Christophe Bourguignat DS cheat sheet

@chris_bour

Today

•  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric?

Introduction

•  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric?

Introduction The newcomer disillusion

The produc(on bad surprise

The business obfusca(on

•  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…)

•  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez

Who I am

•  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…)

•  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez

Who I am

•  Everywhere is fraud E-business, Telco, Medicare,…

•  Easily defined as a classification problem

•  Target well defined ? •  E-business : yes with lag •  Elsewhere : need checks, labels are expensive

Fraud Detection

•  Wikipedia: “Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the number of individuals or items moving out of a collective group over a specific period of time” = Customer leaving

Churn

•  Subscription models: •  Telco •  E-gamming (Wow) •  Ex : Coyote -> 1 year subscription

-> you know when someone leave •  Non subscription models:

•  E-Business (Amazon, Price Minister, Vente Privée) •  E-gamming (Candy Crush, free MMORPG)

-> you approximate someone leaving Candy Crush: days / weeks MMORPG: 2 months (holidays) Price Minister: months

Two types of Churn

•  Predict if a vehicle / machine / part is going to fail •  Classification Problem:

•  Given a future horizon and a failure type. Will this happen for a given vehicle ? -> 2 parameters describe the target •  Vary a lot the target -> spurious correlation •  Just choose it as the result of the exact business need

Predictive Maintenance

•  Target is “will like” or “will buy” •  Target is often proxy of real interest (implicit feedback)

Recommender System

•  Can you model the problem as a ML problem? •  Ex : predictive maintenance •  Ask the right question from a business point of view. Not what you know how to do.

•  Is your target a proxy? •  Recommendation system •  May need bandit algorithm

•  Is it easy to get labels? •  Ex : Fraud detection •  Can be expensive •  Mechanical Turk can be the answer

Summary on Labels

•  Random Split •  Just like in school

Train / test split

•  When and why ? -‐> When each line is independent from the rest (not that common !)

image, document classifica(on, sen(ment analysis (“but aha is the new lol” ) -‐> When you want to quickly iterate / benchmark: “is it even possible?” -‐> When you want to sell something to your boss

•  Column / group based Ex : Caterpillar challenge •  Predict a price •  for each tube id •  Tube id in train and test are different

Objective : being able to generalize to other tubes!

Train / test split

•  Time based •  Simply separate train and test on a time variable

•  When and Why? -> When you want a model that “predict the future” -> When things evolve with time! (most problems!) -> Examples :

Add click prediction, Churn prediction, E-business Fraud detection, Predictive maintenance,…

Train / test split

•  No subscription example •  Target : 4 month without buying •  Features ?

Train / test split : Churn example

Ex : Train and predict scheme

Time

T : present (me T – 4 month

Data is used for target crea(on : ac(vity during the last 4 months

Data is used for feature genera(on.

Use model to predict future churn

Train model using features and target

Ex : Train Evaluation and Predict Scheme

Time

T : present (me T – 4 month


Data is used for feature genera(on

Valida&on set

Use model to predict future churn

Training

Evaluate on the target of the valida(on set

T – 8 month

Data is used for features genera(on.


•  More complex design •  Graph sampling (fraud rings ? ) •  Random sampling in client / machine life •  Mix of column based and time based …

•  The rule : 1)  What is the problem ? 2)  To what would I like to generalize my model ?

Future ? Other individuals ? … 3)  => Train / Test split

Train / test split

•  Predictive Maintenance problem

•  Objective : predict failure in next 3 days.

•  Metric is proportional to accuracy (and 0.57 is the best score !) •  Link to data : https://www.phmsociety.org/events/conference/phm/14/data-challenge

EX PHM Society (Fail example)

•  Failures

EX PHM Society

•  Usage

EX PHM Society

•  Part Replacements

EX PHM Society

•  How to design the evaluation scheme? •  What is the probability that an asset fail in the next 3 days from Now? -> classification problem -> Time based split -> but how do I create a train and a test? •  Choose a date and evaluate what happens 3 days later? -> pb : not enough failures happening •  Choose several dates for each asset? -> beware of asset over-fitting •  In the challenge : random selection of (asset, date) in the future + over sampling of

failures.

EX PHM Society

•  Basic Feature engineering

EX PHM Society

•  Random Sampling

EX PHM Society

This is decent! « With some more work I could have a model that beat randomness enough to be useful »

•  Time based split

EX PHM Society

Wait what?

•  TIME LEAK

EX PHM Society

•  TIME LEAK

EX PHM Society

Tree cuts

•  Beware of the distribution of you features! •  Is there a time dependency?

•  Ex : count, sum, … that will only increase with time •  -> Calculate count and sum rescaled by time / in moving windows instead. •  Can be found in Churn, Fraud detection, Ad click prediction,…

•  A categorical variable dependency? •  Ex : email flag in fraud detection

•  Is there a Network dependency? •  Ex : Fraud / Bot detection (network features can be useful but leaky)

Feature Engineering

•  Final trick : -  Stack train and test and add is_test boolean -  Try to predict is_test -  Check if the model is able to predict -  If so :

-  check the feature importance -  Remove / modify feature and iterate

Feature Engineering

•  Final trick:

•  Back to Phm example:

Feature Engineering

Huge (me leak !

•  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score

•  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)…

Evaluation metric : Classification

•  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score

•  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)…

•  Customs

Evaluation metric : Classification Not good if unbalanced target

When you have an order problem

When you are going stochas(c

When you need to s(ck to business

Accuracy alterna(ve

•  Custom metrics •  Cost based

•  Ex Fraud: •  Mean loss of 50 $ / fraud (FN) •  Mean loss of 20 $ / wrongly cancelled transaction (FP)

•  F1 score often used in papers •  in practice, you often have a business cost


TP FN

TN FP

•  Custom metrics •  Fraud Example 1:

•  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this by manually handling transactions with score higher than threshold •  I have 1 person that does this fulltime and able to deal with 100 transactions / day •  The rest is automatically accepted

-> AUC is not bad -> Recall in 100 transactions / day -> Total money blocked 100 transactions / day In practice AUC more stable… But the money metric can also be used for communication.


•  Custom metrics •  Fraud Example 2:

•  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this automatically by blocking all transactions with score higher than threshold

-> AUC is not bad… But don’t give threshold value. -> F1–Score? -> Cost based is better


•  My cheat sheet


Metric Op&mized By ML model ? Treshold Dependant Applica&on example

Accuracy YES YES image classifica(on, nlp …

F1-‐score NO YES ? Papers ?

AUC NO NO fraud detec(on, churn, healthcare …

Log-‐Loss YES NO add click predic(on

Custom metric NO ? all ?

•  Business Question dictates Evaluation Scheme! •  test set design •  evaluation metric •  Indirectly impact feature engineering •  Indirectly impact label quality

•  Think (not too much) before coding

•  Don’t try to optimize the wrong problem!

Conclusion

Thank you for your attention!

Data & Analytics

Before Kaggle