20
stories behind kaggle competitions wendy kan, data scientist [email protected] @wendykan 5/19/2015 @

Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

Embed Size (px)

Citation preview

Page 1: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

stories behind kaggle competitions

wendy kan, data [email protected]

@wendykan

5/19/2015 @

Page 2: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

kaggle runs public machine learning competitions

Page 3: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

we worked with clients/hosts on various types of problems and data of different sizes

Page 4: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

my job as a data scientist at kaggle

Page 5: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

“data science is not just kaggle competitions”

whyyyy???

Page 6: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

machine learning processes

● Business Problem● Collect Data● Transform Data● Dataset Splitting● Evaluation Metric● Feature Extraction

● Feature Selection● Model Training● Model Ensembling● Methodology Selection● Production System● Ongoing Optimization

Page 7: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

not every problem can be turned into a kaggle competition

Page 8: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle
Page 9: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

size matters! where bigger is better (most of the time)

Page 10: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

data cleaning/formatting:

● easy to make a quick submission● boosts participation● (too) clean data kills creativity

Page 11: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

data privacy/anonymization

Page 12: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

metric: how do you measure success?

● Classification - AUC/ Logarithmic Loss/Accuracy

● Regression - RMSE/MAE

● Ranking - MAP/NDCG

● Other / Custom

https://www.kaggle.com/wiki/Metrics

Page 13: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

the design of a competition shapes how people are going to solve a problem

Page 14: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

Splitting dataset

● training/test

● public/private

Page 15: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

Time series data

Page 16: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

data leakage

“Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from”

“the concept of identifying and harnessing leakage has been openly addressed as one of three key aspects for winning data mining competitions”

“Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al

Page 17: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

do you have thousands of people reviewing your performance at work 24/7?

I do.

Page 18: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

1. people make mistakes. honesty is the best policy.

Page 19: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle

2. crowdsourcing is powerful. anything that can go wrong will go wrong.

Page 20: Stories Behind Kaggle Competitions with Wendy Kan from Kaggle