Upload
pierre-gutierrez
View
1.245
Download
0
Embed Size (px)
Citation preview
• Data Science competitions platform (There are others : DataScience.net in France)
• 332,000 Data Scientists
• today : 192 competitions, 18 active + 516 In class, 12 active • Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex…
What is ?
• Price pool? • 325,000 $ to make on August 31st • Good luck with that ! • Not a good hourly wage
• today : 192 competitions, 18 active Understand : • Lot’s of datasets about approximately every DS topic • Lot’s of winner solutions, tip and tricks, etc… • Lot’s of “beat the benchmark” for beginners
I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ...
Why should I join ?
Most of the time:
• You have a train set with labels and a test set without labels. • You need to learn a model using the train features and predict the test set labels • Your prediction is evaluated using a specific metric • The best prediction wins
What is a Data Science Competition ?
Most of the time:
• You have a train set with labels and a test set without labels. • You need to learn a model using the train features and predict the test set labels • Your prediction is evaluated using a specific metric • The best prediction wins
What is a Data Science Competition?
Why AUC? F1 score? Log loss?
Could that depend on my train/test split?
Where do they come from ? Do you always have some?
Why is the split this way? Random? Time?
What you don’t learn on Kaggle (or in class?): • How to model a business question into a ML problem. • How to manage/create labels. (proxy / missing…) • How to evaluate a model:
• How to choose your metric • How to design your train/test split • How to account for this in feature engineering
Understanding this actually helps you in Kaggle competition : • How to design your cross validation scheme (and not overfit) • How to create relevant features • Hacks and tricks (leak exploitation J)
What is a Data Science Competition?
• Introduction • Labels? • Train and test split? • Feature Engineering? • Evaluation Metric?
Introduction
• Introduction • Labels? • Train and test split? • Feature Engineering? • Evaluation Metric?
Introduction The newcomer disillusion
The produc(on bad surprise
The business obfusca(on
• Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…)
• (More than) Occasional Kaggle competitor • Twitter @prrgutierrez
Who I am
• Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…)
• (More than) Occasional Kaggle competitor • Twitter @prrgutierrez
Who I am
• Everywhere is fraud E-business, Telco, Medicare,…
• Easily defined as a classification problem
• Target well defined ? • E-business : yes with lag • Elsewhere : need checks, labels are expensive
Fraud Detection
• Wikipedia: “Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the number of individuals or items moving out of a collective group over a specific period of time” = Customer leaving
Churn
• Subscription models: • Telco • E-gamming (Wow) • Ex : Coyote -> 1 year subscription
-> you know when someone leave • Non subscription models:
• E-Business (Amazon, Price Minister, Vente Privée) • E-gamming (Candy Crush, free MMORPG)
-> you approximate someone leaving Candy Crush: days / weeks MMORPG: 2 months (holidays) Price Minister: months
Two types of Churn
• Predict if a vehicle / machine / part is going to fail • Classification Problem:
• Given a future horizon and a failure type. Will this happen for a given vehicle ? -> 2 parameters describe the target • Vary a lot the target -> spurious correlation • Just choose it as the result of the exact business need
Predictive Maintenance
• Target is “will like” or “will buy” • Target is often proxy of real interest (implicit feedback)
Recommender System
• Can you model the problem as a ML problem? • Ex : predictive maintenance • Ask the right question from a business point of view. Not what you know how to do.
• Is your target a proxy? • Recommendation system • May need bandit algorithm
• Is it easy to get labels? • Ex : Fraud detection • Can be expensive • Mechanical Turk can be the answer
Summary on Labels
• Random Split • Just like in school
Train / test split
• When and why ? -‐> When each line is independent from the rest (not that common !)
image, document classifica(on, sen(ment analysis (“but aha is the new lol” ) -‐> When you want to quickly iterate / benchmark: “is it even possible?” -‐> When you want to sell something to your boss
• Column / group based Ex : Caterpillar challenge • Predict a price • for each tube id • Tube id in train and test are different
Objective : being able to generalize to other tubes!
Train / test split
• Time based • Simply separate train and test on a time variable
• When and Why? -> When you want a model that “predict the future” -> When things evolve with time! (most problems!) -> Examples :
Add click prediction, Churn prediction, E-business Fraud detection, Predictive maintenance,…
Train / test split
• No subscription example • Target : 4 month without buying • Features ?
Train / test split : Churn example
Ex : Train and predict scheme
Time
T : present (me T – 4 month
Data is used for target crea(on : ac(vity during the last 4 months
Data is used for feature genera(on.
Use model to predict future churn
Train model using features and target
Ex : Train Evaluation and Predict Scheme
Time
T : present (me T – 4 month
Data is used for target crea(on : ac(vity during the last 4 months
Data is used for feature genera(on
Valida&on set
Use model to predict future churn
Training
Evaluate on the target of the valida(on set
T – 8 month
Data is used for features genera(on.
Data is used for target crea(on : ac(vity during the last 4 months
• More complex design • Graph sampling (fraud rings ? ) • Random sampling in client / machine life • Mix of column based and time based …
• The rule : 1) What is the problem ? 2) To what would I like to generalize my model ?
Future ? Other individuals ? … 3) => Train / Test split
Train / test split
• Predictive Maintenance problem
• Objective : predict failure in next 3 days.
• Metric is proportional to accuracy (and 0.57 is the best score !) • Link to data : https://www.phmsociety.org/events/conference/phm/14/data-challenge
EX PHM Society (Fail example)
• How to design the evaluation scheme? • What is the probability that an asset fail in the next 3 days from Now? -> classification problem -> Time based split -> but how do I create a train and a test? • Choose a date and evaluate what happens 3 days later? -> pb : not enough failures happening • Choose several dates for each asset? -> beware of asset over-fitting • In the challenge : random selection of (asset, date) in the future + over sampling of
failures.
EX PHM Society
• Random Sampling
EX PHM Society
This is decent! « With some more work I could have a model that beat randomness enough to be useful »
• Beware of the distribution of you features! • Is there a time dependency?
• Ex : count, sum, … that will only increase with time • -> Calculate count and sum rescaled by time / in moving windows instead. • Can be found in Churn, Fraud detection, Ad click prediction,…
• A categorical variable dependency? • Ex : email flag in fraud detection
• Is there a Network dependency? • Ex : Fraud / Bot detection (network features can be useful but leaky)
Feature Engineering
• Final trick : - Stack train and test and add is_test boolean - Try to predict is_test - Check if the model is able to predict - If so :
- check the feature importance - Remove / modify feature and iterate
Feature Engineering
• “Treshold dependant” • Accuracy • Precision and Recall • F1 score
• “Treshold independant” • AUC • Log Loss • Others (Mean average precision)…
Evaluation metric : Classification
• “Treshold dependant” • Accuracy • Precision and Recall • F1 score
• “Treshold independant” • AUC • Log Loss • Others (Mean average precision)…
• Customs
Evaluation metric : Classification Not good if unbalanced target
When you have an order problem
When you are going stochas(c
When you need to s(ck to business
Accuracy alterna(ve
• Custom metrics • Cost based
• Ex Fraud: • Mean loss of 50 $ / fraud (FN) • Mean loss of 20 $ / wrongly cancelled transaction (FP)
• F1 score often used in papers • in practice, you often have a business cost
Evaluation metric : Classification
TP FN
TN FP
• Custom metrics • Fraud Example 1:
• “I have fraudsters on my e-business website” • I generate a score for each transaction • I handle this by manually handling transactions with score higher than threshold • I have 1 person that does this fulltime and able to deal with 100 transactions / day • The rest is automatically accepted
-> AUC is not bad -> Recall in 100 transactions / day -> Total money blocked 100 transactions / day In practice AUC more stable… But the money metric can also be used for communication.
Evaluation metric : Classification
• Custom metrics • Fraud Example 2:
• “I have fraudsters on my e-business website” • I generate a score for each transaction • I handle this automatically by blocking all transactions with score higher than threshold
-> AUC is not bad… But don’t give threshold value. -> F1–Score? -> Cost based is better
Evaluation metric : Classification
• My cheat sheet
Evaluation metric : Classification
Metric Op&mized By ML model ? Treshold Dependant Applica&on example
Accuracy YES YES image classifica(on, nlp …
F1-‐score NO YES ? Papers ?
AUC NO NO fraud detec(on, churn, healthcare …
Log-‐Loss YES NO add click predic(on
Custom metric NO ? all ?
• Business Question dictates Evaluation Scheme! • test set design • evaluation metric • Indirectly impact feature engineering • Indirectly impact label quality
• Think (not too much) before coding
• Don’t try to optimize the wrong problem!
Conclusion