29
Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks Alan Said @alansaid TU Delft Alejandro Bellogín @abellogin Universidad Autónoma de Madrid ACM RecSys 2014 Foster City, CA, USA

Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Embed Size (px)

DESCRIPTION

Video available here http://www.youtube.com/watch?v=1jHxGCl8RXc Recommender systems research is often based on comparisons of predictive accuracy: the better the evaluation scores, the better the recommender. However, it is difficult to compare results from different recommender systems due to the many options in design and implementation of an evaluation strategy. Additionally, algorithmic implementations can diverge from the standard formulation due to manual tuning and modifications that work better in some situations. In this work we compare common recommendation algorithms as implemented in three popular recommendation frameworks. To provide a fair comparison, we have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics. We also include results using the internal evaluation mechanisms of these frameworks. Our analysis points to large differences in recommendation accuracy across frameworks and strategies, i.e. the same baselines may perform orders of magnitude better or worse across frameworks. Our results show the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.

Citation preview

Page 1: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Alan Said@alansaid

TU Delft

Alejandro Bellogín@abellogin

Universidad Autónoma de Madrid

ACM RecSys 2014Foster City, CA, USA

Page 2: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

2

A RecSys paper outline– We have a new model – it’s

great– We used %DATASET% 100k

to evaluate it– It’s 10% better than our

baseline– It’s 12% better than

[Authors, 2010]

Page 3: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

3

Benchmarking

Page 4: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

4

LibRecmrec

Python-recsys

Page 5: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

5

What are the differences?Some things just work differently• Data splitting• Algorithm design (implementation)• Algorithm optimization• Parameter values• Evaluation• Relevance/ranking• Software architecture• etc

Different design choices!!

How do these choices affect evaluation results?

Page 6: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

6

Evaluate evaluation• Comparison of frameworks• Comparison of implementation• Comparison of results• Objective benchmarking

Page 7: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

7

Algorithmic ImplementationFramework Class Similarity Item-based

LensKit ItemItemScorer CosineVectorSimilarity, PearsonCorrelation

Mahout GenericItemBasedRecommender UncenteredCosineSimilarity, PearsonCorrelationSimilarityMyMediaLite ItemKNN Cosine, Pearson User-based Parameters

LensKit UserUserItemScorerCosineVectorSimilarity, PearsonCorrelation

SimpleNeighborhoodFinder, NeighborhoodSize

Mahout GenericUserBasedRecommenderUncenteredCosineSimilarity, PearsonCorrelationSimilarity

NearestNUserNeighborhood, neighborhoodsize

MyMediaLite UserKNN Cosine, Pearson neighborhoodsize Matrix Factorization

LensKit FunkSVDItemScorer IterationsCountStoppingCondition, factors, iterations

Mahout SVDRecommender FunkSVDFactorizer, factors, iterationsMyMediaLite SVDPlusPlus factors, iterations

Page 8: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

8

There’s more than algorithms thoughThere’s the data, evaluation, and more

Data splits• 80-20 Cross-validation• Random Cross-validation• User-based cross validation• Per-user splits• Per-item splits• Etc.

Evaluation• Metrics• Relevance• Strategies

Page 9: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

9

Real world examples

Movielens 1M[Cremonesi et al, 2010]

Movielens 1M[Yin et al, 2012]

Page 10: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

10

Evaluation

Dataset

Training / Test Framework Evaluation

Results

Algorithm

Page 11: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

11

Internal Evaluation

Dataset

Training / Test Framework Evaluation

Results

Algorithm

Page 12: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

12

Internal Evaluation ResultsAlgorithm Framework nDCG

IB Cosine Mahout 0,00041478

IB Cosine Lenskit 0,94219205

IB Pearson Mahout 0,00516923

IB Pearson Lenskit 0,92454613

SVD50 Mahout 0,10542729

SVD50 Lenskit 0,94346409

UB Cosine Mahout 0,16929545

UB Cosine Lenskit 0,94841356

UB Pearson Mahout 0,16929545

UB Pearson Lenskit 0,94841356

Algorithm Framework RMSEIB Cosine Lenskit 1,01390931

IB Cosine MyMediaLite 0,92476162

IB Pearson Lenskit 1,05018614

IB Pearson MyMediaLite 0,92933246

SVD50 Lenskit 1,01209290

SVD50 MyMediaLite 0,93074012

UB Cosine Lenskit 1,02545490

UB Cosine MyMediaLite 0,93419026

Page 13: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

13

We need a fair and common evaluation protocol!

Page 14: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

14

Reproducible evaluation - BenchmarkingControl all parts of the process

- Data Splitting strategy- Recommendation (black box)- Candidate items generation (what

items to test)- Evaluation

Select strategy• By time• Cross validation• Random• Ratio

Select framework• Apache Mahout• LensKit• MyMediaLite

Select algorithm• Tune settings

Recommend

Define strategy• What is the

ground truth• What users to

evaluate• What items to

evaluate

Select error metrics• RMSE, MAE

Select ranking metrics• nDCG,

Precision/Recall, MAP

RecommendSplitCandidate

itemsEvaluate

http://rival.recommenders.net

Page 15: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

15

Controlled Evaluation

Dataset

Training / Test Framework Evaluation

Results

Algorithm

Page 16: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

16

AN OBJECTIVE BENCHMARK

Lenskit vs. Mahout vs. MyMediaLiteMovielens 100k (additional datasets in the paper)

Page 17: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

17

The FrameworksAM: Apache MahoutLK: LenskitMML: MyMediaLite

The Candidate ItemsRPN: Relevant + N [Koren, KDD 2008]TI: TrainItemsUT: UserTest

Split Pointgl: Globalpu: Per-user

Split Strategycv: 5-fold cross-validationrt: 80-20 random ratio

Algorithms nDCG@10

Page 18: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

18

User Coverage

Page 19: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

19

Catalog Coverage

Page 20: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

20

Time

Page 21: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

21

Good accuracy?

Page 22: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

22

Yes, at the cost of coverage

Page 23: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

23

What’s the best result?

Page 24: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

24

Difficult to say … depends on what you’re evaluating!!

Page 25: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

25

In conclusion• Design choices matter!

– Some more than others• Evaluation needs to be documented• Cross-framework comparison is not easy

– You need to have control!

Page 26: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

26

What have we learnt?

Page 27: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

27

How did we do this?RiVal – an evaluation toolkit for RecSys• http://rival.recommenders.net• http://github.com/recommenders/rival• RiVal demo later today• On Maven central!

• RiVal was also used for this year’s RecSys Challenge

– www.recsyschallenge.com

Page 28: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

28

QUESTIONS?Thanks!

Special thanks: • Zeno Gantner• Michael Ekstrand

Page 29: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

29

• https://www.flickr.com/photos/13698839@N00/3001363490/in/photostream/• http://rick--hunter.deviantart.com/art/Unfair-scale-1-149667590