Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Recommender System Experiments with MyMediaLite
Or: Everything you always wanted to know about offline experiments* (*but were afraid to ask)
Zeno Gantner <[email protected]>
Nokia Location & Commerce, Berlin
HERE Maps by Nokia … in Berlin
● ca. 800 people● HERE Maps platform
– mobile apps● HERE Drive● HERE Maps● HERE Transit (public transport)
– customers● Yahoo Maps● Bing Maps● major car companies: BMW, VW,
Toyota, ...
HERE Maps by Nokia … in Berlin
Maps Search Team● #bbuzz regulars● 3 of us contributed to
Lucene 4.3.0 ;-)
http://2011.berlinbuzzwords.de/content/improving-search-ranking-through-ab-tests-case-studyhttp://2012.berlinbuzzwords.de/sessions/efficient-scoring-lucenehttp://2012.berlinbuzzwords.de/sessions/introducing-cascalog-functional-data-processing-hadoophttp://2012.berlinbuzzwords.de/sessions/relevance-optimization-check-candidate-listshttps://issues.apache.org/jira/browse/LUCENE-4930https://issues.apache.org/jira/browse/LUCENE-4571
(C) Paul L. Dineen; license: CC by; source http://www.flickr.com/photos/pauldineen/4529216647/sizes/o/in/photostream/
+ = ?
Data + Software/Algorithms = ???
(c) Joon Han, license: CC by-sa 3.0, source: http://en.wikipedia.org/wiki/File:Groundhog_day_tip_top_bistro.jpg(c) Diliff; license CC by-3.0
Real-world deployments
Data mining competitions
Research
+ = ?
RecSys Experiments with MyMediaLite
1. Interaction Data
2. Baseline Methods
3. Apples and Oranges
4. Metrics
5. Hyperparameter Tuning
6. Reproducibility
Running Example: MyMediaLite
● RecSys toolkit and evaluation framework
● written in C#/Mono● C#, Python, Ruby, F#● 2 Java ports
(RapidMiner plugin)● regular releases (every
2-3 months) since 2010
● simple● choice● free● documented● tested
http://mymedialite.net/http://github.com/zenogantner/MyMediaLite
Running Example: MyMediaLite
command-line tools● rating_prediction
● item_recommendation
Find all examples here:
http://github.com/zenogantner/mml-eval-examples
1. Interaction Data
Explicit feedback
Not always there.
Implicit feedback● views● clicks● purchases
Often positive-only.
1. Interaction Data
User ID Item ID Timestamp
196 242 881250949
186 302 891717742
22 377 878887116
244 51 880606923
... ... ...
item_recommendation --training-file=F1 --test-file=F2
IDs can be (almost) arbitrary strings
optional
Separator: whitespace,tab, comma, :: Alternative format:
yyyy-mm-dd
Random Splits
item_recommendation … --test-ratio=0.25
Shuffle and split:
Simple, but:● Does not take temporal trends into account.● Does not use all data for testing.
k-fold Cross-Validation
item_recommendation … --cross-validation=4
Shuffle and split:
● Uses each data point for evaluation.● Does not take temporal trends into account.
Chronological Splits
rating_prediction … --chronological-split=0.25
rating_prediction … --chronological-split=01/01/2002
Sort chronologically and split:
● Use the past to predict the “future”.● Takes trends in the data into account.
– time of day, day of week
– season
– trending products
(c) Serolillo, license: CC by 2.5
2. Baseline Methods
Why compare against baselines?● Absolute numbers have no meaning.
– … well, at least here.
– Relative numbers may also have no meaning.● … if you compare to the wrong things.
Good baselines:● the strongest solution that is still simple● the existing solution● standard solutions
– coll. filtering: kNN, vanilla matrix factorization
2. Baseline Methods
item_recommendation … --recommender=Random
item_recommendation … --recommender=MostPopular
item_recommendation …
--recommender=MostPopularByAttributes
--item-attributes=ARTISTS
Item recommendation baselines:● random● popular items (by attribute/category)
(c) Michael Collins; license: CC by-2.0
3. Apples and Oranges
Always check if you measure on the same splits.
It happens quite often …
3. Apples and Oranges
Always check if you measure on the same splits.
It happens quite often … e.g. this ICML 2013 paper:
3. Apples and Oranges
3. Apples and Oranges● On chronological splits of the Netflix dataset,
matrix factorization (“SVD”) models usually do not perform below 0.9 RMSE.
● Chronological splits can be much harder than random splits!
Lessons:● Baselines are important – they can also help us
to “debug” experiments.● Do not compare between simple splits and
chronological splits.
(c) Pastorius; license: CC by 3.0; source: http://commons.wikimedia.org/wiki/File:Plastic_tape_measure.jpg
4. Metrics
What is the right metric?● Know your goal.
– It always depends on what you want to achieve.
– What to measure?
● Criticize your metrics.– They may ignore important aspects of your problem.
– They are just approximations of user behavior.
● Eyeball the results.– Your metrics may fail to catch WTF results.
http://thenoisychannel.com/2012/08/20/wtf-k-measuring-ineffectiveness/
4. Metricsitem_recommendation ... --measures=”prec@5,NDCG”
Precision at k● number of “correct” items in the top k results● The choice of k is specific to your application.● very simple● easy to understand and explain
More ranking measures: NDCG, MAP, ERR
4. MetricsPrecision at k
recommendations precision at 4
bad 0
good 1
bad 0
bad 0
bad --
good --
bad --
1/4
5. Hyperparameter Tuningitem_recommendation … --recommender=WRMF
--recommender-options=”reg=0.01 alpha=2”
● Hyperparameters, e.g.– regularization to control overfitting
– learning rate (for gradient descent methods)
– stopping criterion
● You have to do it. Also for your baselines.● Don't get too fancy.
– Grid search will do it in most cases.
● More advanced:– Nelder-Mead/Simplex
– Particle swarm optimization
5. Hyperparameter Tuning
rating_prediction … --search-hp
Grid search● simple● brute force● embarrassingly parallel
“A practical guide to SVM classification”
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
6. Reproducible Experiments
item_recommendation … --random-seed=1
Random seed● “random” splitting● training initialization● debugging
6. Reproducible Experiments
item_recommendation … --random-seed=1
Besides random seed:● Put everything in version control.
– data, software
– scripts and configuration
● Use build tools like make for automation.– Knows when to re-run your data preprocessing steps.
http://bitaesthetics.com/posts/make-for-data-scientists.html
6. Reproducible Experiments
item_recommendations … --recommender=ExternalItemRecommender --recommender-options=”prediction_file=FILE”
Re-use evaluation code.
Create predictions using external software. Use MyMediaLite for evaluation.
6. Reproducible Experiments
item_recommendations … --recommender=ExternalItemRecommender --recommender-options=”prediction_file=FILE”
Why re-use evaluation code?● Evaluation protocols (splitting+candidate
selection+metrics) are not easy to get right.● Ensures comparability.
– more configuration kept fixed => less risk of accidental differences
● Laziness!
(c) by Caucas; license: CC by-nc-nd 2.0; source: http://www.flickr.com/photos/thecaucas/2597813380/sizes/o/
Summary1. Split your data appropriately.2. Do not compare apples and oranges.3. Compare against simple and strong
baselines.4. Precision at k is a metric that is easy to
explain.5. Grid search is a simple method for
hyperparameter tuning.6. Make your experiments reproducible.7. MyMediaLite can help you with some of these
things ;-). Try it out!
http://github.com/zenogantner/mml-eval-exampleshttp://mymedialite.net/http://github.com/zenogantner/MyMediaLite
(c) Michael Sauers; license CC by-nc-sa 2.0