45
Winning Data Science Competitions 3. 29. 2017 Jeong-Yoon Lee, Ph.D.

Winning Data Science Competitions

Embed Size (px)

Citation preview

Page 1: Winning Data Science Competitions

Winning Data Science Competitions

3. 29. 2017

Jeong-Yoon Lee, Ph.D.

Page 2: Winning Data Science Competitions

Chief Data Scientist, Conversion Logic

70+ Competitions

6 Times Prize Winner (KDD Cup 2012 & 2015)

8 Top 10 Finishes (Deloitte, AARP, Liberty Mutual)

Top 10, Kaggle 2015

Father of 4 boys

Jeong-Yoon Lee, Ph.D.

Page 3: Winning Data Science Competitions

About Conversion Logic

3

Advanced Marketing Attribution For Diverse Customers

Page 4: Winning Data Science Competitions

Why Data Science Competition

Page 5: Winning Data Science Competitions

Why Compete

For fun

For experience

For learning

For networking

5

Page 6: Winning Data Science Competitions

Fun

Competing with others

Continuous improvement

6

Page 7: Winning Data Science Competitions

Experience

7

Page 8: Winning Data Science Competitions

Learning

8

Page 9: Winning Data Science Competitions

Learning

9

Page 10: Winning Data Science Competitions

Networking

10

Page 11: Winning Data Science Competitions

11

Page 12: Winning Data Science Competitions

Data Science Competitions

Page 13: Winning Data Science Competitions

Data Science Competitions

Since 1997

2006 - 2009

Since 2010

Page 14: Winning Data Science Competitions

Competition Structure

Training Data

Test Data

Feature Label

Provided Submission Public LB Score Private LB Score

Page 15: Winning Data Science Competitions

Kaggle

250+ competitions since 2010

900K users

50K+ competitors

$3MM+ prize paid out

Page 16: Winning Data Science Competitions

Kaggle

Page 17: Winning Data Science Competitions

Kaggle

Page 18: Winning Data Science Competitions

Misconceptions on Competitions

Page 19: Winning Data Science Competitions

Misconceptions on Competitions

No ETL

No EDA

Not worth it

Not for production

19

Page 20: Winning Data Science Competitions

No ETL? - Deloitte Western Australia Rental Prices

20

Page 21: Winning Data Science Competitions

No ETL? - Outbrain Click Prediction

21

2B page views. 16.9MM clicks. 700MM users. 560 sites

Page 22: Winning Data Science Competitions

No ETL? - YouTube-8M Video Understanding Challenge

22

1.7TB feature-level data. 31GB video-level data.

Page 23: Winning Data Science Competitions

No ETL?

23

Page 24: Winning Data Science Competitions

No EDA?Most of competitions provide actual labels - typical EDA

Anonymized data - more creative EDAo People decode age, states, time intervals, income, etc.

24

Page 25: Winning Data Science Competitions

No EDA?

Anonymized data - more creative EDA

25

Page 26: Winning Data Science Competitions

Not worth it?

Performance matters

You walk easier when you can run

26

Page 27: Winning Data Science Competitions

Not for Production?

Kaggle Kernelo Max execution time:10 minutes

o Max file output: 500MB

o Memory limit: 8GB

27

Page 28: Winning Data Science Competitions

Ensemble Pipeline at Conversion Logic

28

Page 29: Winning Data Science Competitions

Best Practices

Page 30: Winning Data Science Competitions

Best Practices

Feature Engineering

Diverse Algorithms

Cross Validation

Ensemble

Collaboration

30

Page 31: Winning Data Science Competitions

Feature Engineering

31

Types Note

Numerical Log, Log2(1 + x), Box-Cox, Normalization, Binning

Categorical One-hot-encoding, Label-encoding, Count, Weight-of-Evidence

Text Bag-of-Words, TF-IDF, N-gram, Character-n-gram, K-skip-n-gram

Timeseries/ Sensor data Descriptive Statistics, Derivatives, FFT, MFCC, ERP

Network Graph Degree, Closeness, Betweenness, PageRank

Numerical/ Timeseries Convert to categorical features using RF/GBM

Dimensionality Reduction PCA, SVD, Autoencoder, Hashing Trick

Interaction Addition/substraction/mutiplicaiton/division. Hashing Trick

* More comprehensive overview on feature engineering by HJ van Veen: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750

Page 32: Winning Data Science Competitions

Diverse AlgorithmsAlgorithm Tool Note

Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions

Random Forests Scikit-Learn, randomForest Used to be popular before GBM

Extremely Random Trees Scikit-Learn

Neural Networks/ Deep Learning Keras, MXNet, Torch, CNTK Blends well with GBM. Best at image and speech recognition competitions

Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.

Support Vector Machine Scikit-Learn

FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions

Factorization Machine libFM, fastFM Winning solution for KDD Cup 2012

Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)32

Page 33: Winning Data Science Competitions

Cross Validation

Training data are split into five folds where the sample size and dropout rate are preserved (stratified).

33

Page 34: Winning Data Science Competitions
Page 35: Winning Data Science Competitions

Ensemble - Stacking

* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/35

Page 36: Winning Data Science Competitions

KDDCup 2015 Solution

36

Page 37: Winning Data Science Competitions

Collaboration

Page 38: Winning Data Science Competitions

Collaboration – Git Repo + S3/Dropbox

38

Page 39: Winning Data Science Competitions

Collaboration – Common Validation

39

Page 40: Winning Data Science Competitions

Collaboration – Internal Leaderboard

40

Page 41: Winning Data Science Competitions

Best Practices

For fun

For experiences

For learning

For networking

41

Feature Engineering

Diverse Algorithms

Cross Validation

Ensemble

Collaboration

Why Competition

Page 42: Winning Data Science Competitions

Things That Help

42

Keep competition journals and repos – both during and after competitions

Build and improve the automated pipeline and library for competitions

• https://github.com/jeongyoonlee/Kaggler

• https://gitlab.com/jeongyoonlee/allstate-claims-severity/tree/master

• http://kaggler.com/kagglers-toolbox-setup/

Be humble, and ready to try and learn something new

Make a commitment and work on competitions no matter what on a regular basis

Page 43: Winning Data Science Competitions

Resources

43

No Free Hunch by Kaggle

Winning Tips on Machine Learning Competitions by Marios Michailidis (KazAnova)

Feature Engineering, mlwave.com by HJ van Veen (Triskelion)

fastml.com by Zygmunt Zając (Foxtrot)

kaggler.com, facebook.com/Kaggler by Jeong-Yoon Lee @ CL and Hang Li @ Hulu

Tianqi Chen @ UW – Won KDDCup 2012, DSB 2015. Author of XGBoost, MXNet

Gilberto Titericz Junior in San Francisco - #1 at Kaggle

Page 44: Winning Data Science Competitions

Active Competitions

44

Kaggle – 6 Featured, 1 Job Competitions

KDD Cup 2017

RecSys Challenge 2017

CIKM AnalytiCup 2017

Page 45: Winning Data Science Competitions

Thank You