How to Win Machine Learning Competitions ?

How to Win Machine Learning competitions By Marios Michailidis

It’s not the destination…it’s the journey!

What is kaggle• world's biggest predictive modelling

competition platform• Half a million members• Companies host data challenges.• Usual tasks include:

– Predict topic or sentiment from text.– Predict species/type from image.– Predict store/product/area sales– Marketing response

Inspired by Horse races…!• At the University of Southampton, an

entrepreneur talked to us about how he was able to predict the horse races with regression!

Was curious, wanted to learn more• learned statistical tools (Like SAS, SPSS, R)• I became more passionate!• Picked up programming skills

Built KazAnova• Generated a couple of algorithms and data

techniques and decided to make them public so that others can gain from it.

• I released it at (http://www.kazanovaforanalytics.com/)

• Named it after ANOVA (Statistics) and • KAZANI , mom’s last name.

http://www.kazanovaforanalytics.com/

Joined Kaggle!• Was curious about Kaggle. • Joined a few contests and learned lots . • The community was very open to sharing and

collaboration.

Other interesting tasks!• predict the correct answer from a science test,

using the Wikipedia.• Predict which virus has infected a file.• Predict the NCAA tournament!• Predict the Higgs Bosson• Out of many different numerical series predict

which one is the cause

3 Years of modelling competitions• Over 90 competitions• Participated with 45 different teams• 22 top 10 finishes• 12 times prize winner• 3 different modelling platforms• Ranked 1st out of 480,000 data scientists

What's next• PhD (UCL) about recommender systems. • More kaggling (but less intense) !

So… what wins competitions?

In short:• Understand the problem • Discipline • try problem-specific things or new approaches • The hours you put in• the right tools• Collaboration. • Ensembling

More specifically…

● Understand the problem and the function to optimize● Choose what seems to be a good Algorithm for a problem and

iterate• Clean the data• Scale the data• Make feature transformations• Make feature derivations• Use cross validation to

Make feature selections Tune hyper parameters of the algorithm

– Do this for many other algorithms – Always exploiting their benefits

– Find best way to combine or ensemble the different algorithms● Different type of models and when to use them

Understand the metric to optimize

● For example in the competition it was AUC (Area Under the roc Curve).

● This is a ranking metric● It shows how consistently your good cases have higher score than

your bad cases.● Not all algorithms optimize the metric you want.● Common metrics:

– AUC, – Classification accuracy,– precision,– NDCG, – RMSE – MAE – Deviance

Choose the algorithm

● In most cases many algorithms are experimented before finding the right one(s).

● Those who try more models/parameters have higher chance to win contests than others.

● Any algorithm that makes sense to be used, should be used.

In the algorithm: Clean the Data

● The data cleaning step is not independent of the chosen algorithm. For different algorithms , different cleaning filters should be applied.

● Treating missing values is really important, certain algorithms are more forgiving than others

● In other occasions it may make more sense to carefully replace a missing value with a sensible one (like average of feature), treat it as separate category or even remove the whole observation.

● Similarly we search for outliers and again other models are more forgiving, while in others the impact of outliers is detrimental.

● How to decide the best method?– Try them all or– Experience and literature, but mostly the first (bold)

In the algorithm: scaling the data

● Certain algorithms cannot deal with unscaled data.● scale techniques

– Max scaler: Divide each feature with highest absolute value– Normalization: (subtract mean and divide with standard

deviation)– Conditional scaling : scale only under certain conditions (e.g

in medicine we tend to scale per subject to make their features comparable)

In the algorithm: feature transformations● For certain algorithms there is benefit in changing the features

because they help them converge faster and better.

● Common transformations will include:1. LOG, SQRT (variable) , smoothens variables2. Dummies for categorical variables3. Sparse matrices To be able to compress the data4. 1st derivatives : To smoothen data.5. Weights of evidence (transforming variables while using

information of the target variable)6. Unsupervised methods that reduce dimensionality (SVD, PCA,

ISOMAP, KDTREE, clustering

In the algorithm: feature derivations

● In many problems this is the most important thing. For example:– Text classification : generate the corpus of words and make TFIDF– Sounds : convert sounds to frequencies through Fourier

transformations– Images : make convolution. E.g. break down an image to pixels and

extract different parts of the image.– Interactions: Really important for some models. For our algorithms

too! E.g. have variables that show if an item is popular AND the customer likes it.

– Other that makes sense: similarity features, dimensionality reduction features or even predictions from other models as features.

In the algorithm: Cross-Validation

● This basically means that from my main set , I create RANDOMLY 2 sets. I built (train) my algorithm with the first one (lets call it training set) and score the other (lets call it validation set). I repeat this process multiple times and always check how my model performs on the test set in respect to the metric I want to optimize.

● The process may look like:1. For 10 (you choose how many X) times

1. Split the set in training (50%-90% of the original data)2. And validation (50%-10% of the original data)

3. Then fit the algorithm on the training set4. Score the validation set.

5. Save the result of that scoring in respect to the chosen metric.

2. Calculate the average of these 10 (X) times. That how much you expect this score in real life an dis generally a good estimate.

● Remember to use a SEED to be bale to replicate these X splits

In Cross-Validation: Do feature selection

● It is unlikely that all features are useful and some of them may be damaging your model, because…

– They do not provide new information (Colinearity)– They are just not predictive (just noise)

● There are many ways to do feature selection:– Run the algorithm and seek an internal measure to retrieve the most important

features (no cross validation)– Forward selection with or with no cross validation– Backward selection with or with no cross validation– Noise injection– Hybrid of all methods

● Normally a forward method is chosen with cv. That is we add a feature and then we split the data in the exact X times as we did before and we check whether our metric improved:

– If yes, the feature remains– Else, Wiedersehen!

In Cross-Validation: Hyper Parameter Optimization

● This generally takes lots of time depending on the algorithm . For example in a random forest, the best model would need to be determined based on some parameters as number of trees, max depth, minimum cases in a split, features to consider at each split etc.

● One way to find the best parameters is to manually change one (e.g. max_depth=10 instead of 9) while you keep everything else constant. I found using this method helps you understand more about what would work in a specific dataset.

● Another way is try many possible combinations of hyper parameters. We normally do that with Grid Search where we provide an array of all possible values to be trialled with cross validation (e.g. try max_depth {8,9,10,11} and number of trees {90,120,200,500}

Train Many Algorithms

● Try to exploit their strengths ● For example, focus on using linear features with linear

regression (e.g. the higher the age the higher the income) and non-linear ones with Random forest

● Make it so that each model tries to capture something new or even focus on different part of the data

Ensemble

● Key part (in winning competitions at least) to combine the various models made .

● Remember, even a crappy model can be useful to some small extend.

● Possible ways to ensemble:– Simple average (Model1 prediction + model2 prediction)/2– Average Ranks for AUC (simple average after converting to rank)– Manually tune weights with cross validation– Using Geomean weighted average– Use Meta-Modelling (also called stack generalization or stacking)

● Check github for a complete example of these methods using the Amazon comp hosted by kaggle : https://github.com/kaz-Anova/ensemble_amazon (top 60 rank) .

https://github.com/kaz-Anova/ensemble_amazon

Different models I have experimented vol 1

● Logistic/Linear/discriminant regression: Fast, Scalable, Comprehensible, solid under high dimensionality, can be memory-light. Best when relationships are linear or all features are categorical. Good for text classification too .

● Random Forests : Probably the best one-off overall algorithm out there (to my experience) . Fast, Scalable , memory-medium. Best when all features are numeric-continuous and there are strong non-linear relationships. Does not cope well with high dimensionality.

● Gradient Boosting (Trees): Less memory intense as forests (as individual predictors tend to be weaker). Fast, Semi-Scalable, memory-medium. Is good when forests are good

● Neural Nets (AKA deep Learning): Good for tasks humans are good at: Image Recognition, sound recognition. Good with categorical variables too (as they replicate on-and-off signals). Medium-speed, Scalable, memory-light . Generally good for linear and non-linear tasks. May take a lot to train depending on structure. Many parameters to tune. Very prone to over and under fitting.

● Support Vector Machines (SVMs): Medium-Speed, not scalable, memory intense. Still good at capturing linear and non linear relationships. Holding the kernel matrix takes too much memory. Not advisable for data sets bigger than 20k.

● K Nearest Neighbours: Slow (depending on the size), Not easily scalable, memory-heavy. Good when really defining the good of the bad is matter of how much he/she looks to specific individuals. Also good when number of target variables are many as the similarity measures remain the same across different observations. Good for text classification too.

● Naïve Bayes : Quick, scalable, memory-ok. Good for quick classifications on big datasets. Not particularly predictive.

● Factorization Machines: Good gateways between Linear and non-linear problems. Stand between regressions , Knns and neural networks. Memory Medium, semi-scalable, medium-speed. Good for predicting the rating a customer will assign to a pruduct

Different models I have experimented vol 2

Tools vol 1

● Languages : Python, R, Java

● Liblinear : for linear models http://www.csie.ntu.edu.tw/~cjlin/liblinear/

● LibSvm for Support Vector machines www.csie.ntu.edu.tw/~cjlin/libsvm/

● Scikit package in python for text classification, random forests and gradient boosting machines scikit-learn.org/stable/

● Xgboost for fast scalable gradient boosting https://github.com/tqchen/xgboost

● LightGBM https://github.com/Microsoft/LightGBM

● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear models

● http://www.heatonresearch.com/encog encog for neural nets

● H2O in R for many models

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

https://github.com/tqchen/xgboost

https://github.com/Microsoft/LightGBM



http://www.heatonresearch.com/encog

● LibFm www.libfm.org

● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/

● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/

● Graphchi for factorizations : https://github.com/GraphChi

● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html

● Cxxnet : One of the best implementation of convolutional neural nets out there. Difficult to install and requires GPU with NVDIA Graphics card. https://github.com/antinucleon/cxxnet

● RankLib: The best library out there made in java suited for ranking algorithms (e.g. rank products for customers) that supports optimization fucntions like NDCG. people.cs.umass.edu/~vdang/ranklib.html

● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne ) for nets. This assumes you have Theano (http://deeplearning.net/software/theano/ ) or Tensorflow https://www.tensorflow.org/ .

Tools vol 2

http://www.libfm.org/

https://www.csie.ntu.edu.tw/~cjlin/libffm/

https://www.csie.ntu.edu.tw/~cjlin/libffm/

http://www.cs.waikato.ac.nz/ml/weka/

https://github.com/GraphChi

https://dato.com/products/create/open_source.html

https://github.com/antinucleon/cxxnet

http://keras.io/

https://github.com/Lasagne/Lasagne

https://github.com/Lasagne/Lasagne

http://deeplearning.net/software/theano/

http://deeplearning.net/software/theano/

https://www.tensorflow.org/

https://www.tensorflow.org/

Where to go next to prepare for competitions

● Coursera : https://www.coursera.org/course/ml Andrew’s NG class

● Kaggle.com : many competitions for learning. For instance: http://www.kaggle.com/c/titanic-gettingStarted . Look for the “knowledge flag”

● Very good slides from university of UTAH: www.cs.utah.edu/~piyush/teaching/cs5350.html

● clopinet.com/challenges/ . Many past predictive modelling competitions with tutorials.

● Wikipedia. Not to underestimate. Still the best source of information out there (collectively) .

https://www.coursera.org/course/ml

http://www.kaggle.com/c/titanic-gettingStarted

http://www.cs.utah.edu/~piyush/teaching/cs5350.html

Data & Analytics

How to Win Machine Learning Competitions ?