Upload
jeong-yoon-lee
View
2.117
Download
12
Embed Size (px)
Citation preview
Winning Data Science Competitions
3. 29. 2017
Jeong-Yoon Lee, Ph.D.
Chief Data Scientist, Conversion Logic
70+ Competitions
6 Times Prize Winner (KDD Cup 2012 & 2015)
8 Top 10 Finishes (Deloitte, AARP, Liberty Mutual)
Top 10, Kaggle 2015
Father of 4 boys
Jeong-Yoon Lee, Ph.D.
About Conversion Logic
3
Advanced Marketing Attribution For Diverse Customers
Why Data Science Competition
Why Compete
For fun
For experience
For learning
For networking
5
Fun
Competing with others
Continuous improvement
6
Experience
7
Learning
8
Learning
9
Networking
10
11
Data Science Competitions
Data Science Competitions
Since 1997
2006 - 2009
Since 2010
Competition Structure
Training Data
Test Data
Feature Label
Provided Submission Public LB Score Private LB Score
Kaggle
250+ competitions since 2010
900K users
50K+ competitors
$3MM+ prize paid out
Kaggle
Kaggle
Misconceptions on Competitions
Misconceptions on Competitions
No ETL
No EDA
Not worth it
Not for production
19
No ETL? - Deloitte Western Australia Rental Prices
20
No ETL? - Outbrain Click Prediction
21
2B page views. 16.9MM clicks. 700MM users. 560 sites
No ETL? - YouTube-8M Video Understanding Challenge
22
1.7TB feature-level data. 31GB video-level data.
No ETL?
23
No EDA?Most of competitions provide actual labels - typical EDA
Anonymized data - more creative EDAo People decode age, states, time intervals, income, etc.
24
No EDA?
Anonymized data - more creative EDA
25
Not worth it?
Performance matters
You walk easier when you can run
26
Not for Production?
Kaggle Kernelo Max execution time:10 minutes
o Max file output: 500MB
o Memory limit: 8GB
27
Ensemble Pipeline at Conversion Logic
28
Best Practices
Best Practices
Feature Engineering
Diverse Algorithms
Cross Validation
Ensemble
Collaboration
30
Feature Engineering
31
Types Note
Numerical Log, Log2(1 + x), Box-Cox, Normalization, Binning
Categorical One-hot-encoding, Label-encoding, Count, Weight-of-Evidence
Text Bag-of-Words, TF-IDF, N-gram, Character-n-gram, K-skip-n-gram
Timeseries/ Sensor data Descriptive Statistics, Derivatives, FFT, MFCC, ERP
Network Graph Degree, Closeness, Betweenness, PageRank
Numerical/ Timeseries Convert to categorical features using RF/GBM
Dimensionality Reduction PCA, SVD, Autoencoder, Hashing Trick
Interaction Addition/substraction/mutiplicaiton/division. Hashing Trick
* More comprehensive overview on feature engineering by HJ van Veen: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750
Diverse AlgorithmsAlgorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, Torch, CNTK Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM, fastFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)32
Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
33
Ensemble - Stacking
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/35
KDDCup 2015 Solution
36
Collaboration
Collaboration – Git Repo + S3/Dropbox
38
Collaboration – Common Validation
39
Collaboration – Internal Leaderboard
40
Best Practices
For fun
For experiences
For learning
For networking
41
Feature Engineering
Diverse Algorithms
Cross Validation
Ensemble
Collaboration
Why Competition
Things That Help
42
Keep competition journals and repos – both during and after competitions
Build and improve the automated pipeline and library for competitions
• https://github.com/jeongyoonlee/Kaggler
• https://gitlab.com/jeongyoonlee/allstate-claims-severity/tree/master
• http://kaggler.com/kagglers-toolbox-setup/
Be humble, and ready to try and learn something new
Make a commitment and work on competitions no matter what on a regular basis
Resources
43
No Free Hunch by Kaggle
Winning Tips on Machine Learning Competitions by Marios Michailidis (KazAnova)
Feature Engineering, mlwave.com by HJ van Veen (Triskelion)
fastml.com by Zygmunt Zając (Foxtrot)
kaggler.com, facebook.com/Kaggler by Jeong-Yoon Lee @ CL and Hang Li @ Hulu
Tianqi Chen @ UW – Won KDDCup 2012, DSB 2015. Author of XGBoost, MXNet
Gilberto Titericz Junior in San Francisco - #1 at Kaggle
Active Competitions
44
Kaggle – 6 Featured, 1 Job Competitions
KDD Cup 2017
RecSys Challenge 2017
CIKM AnalytiCup 2017
Thank You