Recommender System Experiments with MyMediaLite fileHERE Maps by Nokia … in Berlin ca. 800 people HERE Maps platform – mobile apps HERE Drive HERE Maps HERE Transit (public transport)

Recommender System Experiments with MyMediaLite

Or: Everything you always wanted to know about offline experiments* (*but were afraid to ask)

Zeno Gantner <[email protected]>

Nokia Location & Commerce, Berlin

HERE Maps by Nokia … in Berlin

● ca. 800 people● HERE Maps platform

– mobile apps● HERE Drive● HERE Maps● HERE Transit (public transport)

– customers● Yahoo Maps● Bing Maps● major car companies: BMW, VW,

Toyota, ...

HERE Maps by Nokia … in Berlin

Maps Search Team● #bbuzz regulars● 3 of us contributed to

Lucene 4.3.0 ;-)

http://2011.berlinbuzzwords.de/content/improving-search-ranking-through-ab-tests-case-studyhttp://2012.berlinbuzzwords.de/sessions/efficient-scoring-lucenehttp://2012.berlinbuzzwords.de/sessions/introducing-cascalog-functional-data-processing-hadoophttp://2012.berlinbuzzwords.de/sessions/relevance-optimization-check-candidate-listshttps://issues.apache.org/jira/browse/LUCENE-4930https://issues.apache.org/jira/browse/LUCENE-4571

http://2011.berlinbuzzwords.de/content/improving-search-ranking-through-ab-tests-case-study

http://2012.berlinbuzzwords.de/sessions/relevance-optimization-check-candidate-lists

(C) Paul L. Dineen; license: CC by; source http://www.flickr.com/photos/pauldineen/4529216647/sizes/o/in/photostream/

+ = ?

Data + Software/Algorithms = ???

(c) Joon Han, license: CC by-sa 3.0, source: http://en.wikipedia.org/wiki/File:Groundhog_day_tip_top_bistro.jpg(c) Diliff; license CC by-3.0

Real-world deployments

Data mining competitions

Research

+ = ?

RecSys Experiments with MyMediaLite

1. Interaction Data

2. Baseline Methods

3. Apples and Oranges

4. Metrics

5. Hyperparameter Tuning

6. Reproducibility

Running Example: MyMediaLite

● RecSys toolkit and evaluation framework

● written in C#/Mono● C#, Python, Ruby, F#● 2 Java ports

(RapidMiner plugin)● regular releases (every

2-3 months) since 2010

● simple● choice● free● documented● tested

http://mymedialite.net/http://github.com/zenogantner/MyMediaLite

http://mymedialite.net/

Running Example: MyMediaLite

command-line tools● rating_prediction

● item_recommendation

Find all examples here:

http://github.com/zenogantner/mml-eval-examples

1. Interaction Data

Explicit feedback

Not always there.

Implicit feedback● views● clicks● purchases

Often positive-only.

1. Interaction Data

User ID Item ID Timestamp

196 242 881250949

186 302 891717742

22 377 878887116

244 51 880606923

... ... ...

item_recommendation --training-file=F1 --test-file=F2

IDs can be (almost) arbitrary strings

optional

Separator: whitespace,tab, comma, :: Alternative format:

yyyy-mm-dd

Random Splits

item_recommendation … --test-ratio=0.25

Shuffle and split:

Simple, but:● Does not take temporal trends into account.● Does not use all data for testing.

k-fold Cross-Validation

item_recommendation … --cross-validation=4

Shuffle and split:

● Uses each data point for evaluation.● Does not take temporal trends into account.

Chronological Splits

rating_prediction … --chronological-split=0.25

rating_prediction … --chronological-split=01/01/2002

Sort chronologically and split:

● Use the past to predict the “future”.● Takes trends in the data into account.

– time of day, day of week

– season

– trending products

(c) Serolillo, license: CC by 2.5

2. Baseline Methods

Why compare against baselines?● Absolute numbers have no meaning.

– … well, at least here.

– Relative numbers may also have no meaning.● … if you compare to the wrong things.

Good baselines:● the strongest solution that is still simple● the existing solution● standard solutions

– coll. filtering: kNN, vanilla matrix factorization

2. Baseline Methods

item_recommendation … --recommender=Random

item_recommendation … --recommender=MostPopular

item_recommendation …

--recommender=MostPopularByAttributes

--item-attributes=ARTISTS

Item recommendation baselines:● random● popular items (by attribute/category)

(c) Michael Collins; license: CC by-2.0


Always check if you measure on the same splits.

It happens quite often …


Always check if you measure on the same splits.

It happens quite often … e.g. this ICML 2013 paper:


3. Apples and Oranges● On chronological splits of the Netflix dataset,

matrix factorization (“SVD”) models usually do not perform below 0.9 RMSE.

● Chronological splits can be much harder than random splits!

Lessons:● Baselines are important – they can also help us

to “debug” experiments.● Do not compare between simple splits and

chronological splits.

(c) Pastorius; license: CC by 3.0; source: http://commons.wikimedia.org/wiki/File:Plastic_tape_measure.jpg

4. Metrics

What is the right metric?● Know your goal.

– It always depends on what you want to achieve.

– What to measure?

● Criticize your metrics.– They may ignore important aspects of your problem.

– They are just approximations of user behavior.

● Eyeball the results.– Your metrics may fail to catch WTF results.

http://thenoisychannel.com/2012/08/20/wtf-k-measuring-ineffectiveness/

4. Metricsitem_recommendation ... --measures=”prec@5,NDCG”

Precision at k● number of “correct” items in the top k results● The choice of k is specific to your application.● very simple● easy to understand and explain

More ranking measures: NDCG, MAP, ERR

4. MetricsPrecision at k

recommendations precision at 4

bad 0

good 1

bad 0

bad 0

bad --

good --

bad --

1/4

5. Hyperparameter Tuningitem_recommendation … --recommender=WRMF

--recommender-options=”reg=0.01 alpha=2”

● Hyperparameters, e.g.– regularization to control overfitting

– learning rate (for gradient descent methods)

– stopping criterion

● You have to do it. Also for your baselines.● Don't get too fancy.

– Grid search will do it in most cases.

● More advanced:– Nelder-Mead/Simplex

– Particle swarm optimization

5. Hyperparameter Tuning

rating_prediction … --search-hp

Grid search● simple● brute force● embarrassingly parallel

“A practical guide to SVM classification”

http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

6. Reproducible Experiments

item_recommendation … --random-seed=1

Random seed● “random” splitting● training initialization● debugging


item_recommendation … --random-seed=1

Besides random seed:● Put everything in version control.

– data, software

– scripts and configuration

● Use build tools like make for automation.– Knows when to re-run your data preprocessing steps.

http://bitaesthetics.com/posts/make-for-data-scientists.html


item_recommendations … --recommender=ExternalItemRecommender --recommender-options=”prediction_file=FILE”

Re-use evaluation code.

Create predictions using external software. Use MyMediaLite for evaluation.


item_recommendations … --recommender=ExternalItemRecommender --recommender-options=”prediction_file=FILE”

Why re-use evaluation code?● Evaluation protocols (splitting+candidate

selection+metrics) are not easy to get right.● Ensures comparability.

– more configuration kept fixed => less risk of accidental differences

● Laziness!

(c) by Caucas; license: CC by-nc-nd 2.0; source: http://www.flickr.com/photos/thecaucas/2597813380/sizes/o/

Summary1. Split your data appropriately.2. Do not compare apples and oranges.3. Compare against simple and strong

baselines.4. Precision at k is a metric that is easy to

explain.5. Grid search is a simple method for

hyperparameter tuning.6. Make your experiments reproducible.7. MyMediaLite can help you with some of these

things ;-). Try it out!

http://github.com/zenogantner/mml-eval-exampleshttp://mymedialite.net/http://github.com/zenogantner/MyMediaLite

(c) Michael Sauers; license CC by-nc-sa 2.0

http://mymedialite.net/

http://github.com/zenogantner/MyMediaLite

Documents

Recommender System Experiments with MyMediaLite fileHERE Maps by Nokia … in Berlin ca. 800 people HERE Maps platform – mobile apps HERE Drive HERE Maps HERE Transit (public transport)