Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Alan Said@alansaid

TU Delft

Alejandro Bellogín@abellogin

Universidad Autónoma de Madrid

ACM RecSys 2014Foster City, CA, USA

A RecSys paper outline– We have a new model – it’s

great– We used %DATASET% 100k

to evaluate it– It’s 10% better than our

baseline– It’s 12% better than

[Authors, 2010]

Benchmarking

LibRecmrec

Python-recsys

What are the differences?Some things just work differently• Data splitting• Algorithm design (implementation)• Algorithm optimization• Parameter values• Evaluation• Relevance/ranking• Software architecture• etc

Different design choices!!

How do these choices affect evaluation results?

Evaluate evaluation• Comparison of frameworks• Comparison of implementation• Comparison of results• Objective benchmarking

Algorithmic ImplementationFramework Class Similarity Item-based

LensKit ItemItemScorer CosineVectorSimilarity, PearsonCorrelation

Mahout GenericItemBasedRecommender UncenteredCosineSimilarity, PearsonCorrelationSimilarityMyMediaLite ItemKNN Cosine, Pearson User-based Parameters

LensKit UserUserItemScorerCosineVectorSimilarity, PearsonCorrelation

SimpleNeighborhoodFinder, NeighborhoodSize

Mahout GenericUserBasedRecommenderUncenteredCosineSimilarity, PearsonCorrelationSimilarity

NearestNUserNeighborhood, neighborhoodsize

MyMediaLite UserKNN Cosine, Pearson neighborhoodsize Matrix Factorization

LensKit FunkSVDItemScorer IterationsCountStoppingCondition, factors, iterations

Mahout SVDRecommender FunkSVDFactorizer, factors, iterationsMyMediaLite SVDPlusPlus factors, iterations

There’s more than algorithms thoughThere’s the data, evaluation, and more

Data splits• 80-20 Cross-validation• Random Cross-validation• User-based cross validation• Per-user splits• Per-item splits• Etc.

Evaluation• Metrics• Relevance• Strategies

Real world examples

Movielens 1M[Cremonesi et al, 2010]

Movielens 1M[Yin et al, 2012]

Evaluation

Dataset

Training / Test Framework Evaluation

Results

Algorithm

Internal Evaluation

Dataset

Results

Algorithm

Internal Evaluation ResultsAlgorithm Framework nDCG

IB Cosine Mahout 0,00041478

IB Cosine Lenskit 0,94219205

IB Pearson Mahout 0,00516923

IB Pearson Lenskit 0,92454613

SVD50 Mahout 0,10542729

SVD50 Lenskit 0,94346409

UB Cosine Mahout 0,16929545

UB Cosine Lenskit 0,94841356

UB Pearson Mahout 0,16929545

UB Pearson Lenskit 0,94841356

Algorithm Framework RMSEIB Cosine Lenskit 1,01390931

IB Cosine MyMediaLite 0,92476162

IB Pearson Lenskit 1,05018614

IB Pearson MyMediaLite 0,92933246

SVD50 Lenskit 1,01209290

SVD50 MyMediaLite 0,93074012

UB Cosine Lenskit 1,02545490

UB Cosine MyMediaLite 0,93419026

We need a fair and common evaluation protocol!

Reproducible evaluation - BenchmarkingControl all parts of the process

- Data Splitting strategy- Recommendation (black box)- Candidate items generation (what

items to test)- Evaluation

Select strategy• By time• Cross validation• Random• Ratio

Select framework• Apache Mahout• LensKit• MyMediaLite

Select algorithm• Tune settings

Recommend

Define strategy• What is the

ground truth• What users to

evaluate• What items to

evaluate

Select error metrics• RMSE, MAE

Select ranking metrics• nDCG,

Precision/Recall, MAP

RecommendSplitCandidate

itemsEvaluate

http://rival.recommenders.net

Controlled Evaluation

Dataset

Results

Algorithm

AN OBJECTIVE BENCHMARK

Lenskit vs. Mahout vs. MyMediaLiteMovielens 100k (additional datasets in the paper)

The FrameworksAM: Apache MahoutLK: LenskitMML: MyMediaLite

The Candidate ItemsRPN: Relevant + N [Koren, KDD 2008]TI: TrainItemsUT: UserTest

Split Pointgl: Globalpu: Per-user

Split Strategycv: 5-fold cross-validationrt: 80-20 random ratio

Algorithms nDCG@10

User Coverage

Catalog Coverage

Good accuracy?

Yes, at the cost of coverage

What’s the best result?

Difficult to say … depends on what you’re evaluating!!

In conclusion• Design choices matter!

– Some more than others• Evaluation needs to be documented• Cross-framework comparison is not easy

– You need to have control!

What have we learnt?

How did we do this?RiVal – an evaluation toolkit for RecSys• http://rival.recommenders.net• http://github.com/recommenders/rival• RiVal demo later today• On Maven central!

• RiVal was also used for this year’s RecSys Challenge

– www.recsyschallenge.com

QUESTIONS?Thanks!

Special thanks: • Zeno Gantner• Michael Ekstrand

• https://www.flickr.com/photos/13698839@N00/3001363490/in/photostream/• http://rick--hunter.deviantart.com/art/Unfair-scale-1-149667590

Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Data & Analytics

On Evaluating the Renaissance Benchmarking Suite: Variety ... · movie-lens Recommender for the MovieLens dataset using Spark ML [39]. data-parallel, compute-bound naive-bayes Multinomial

Recommender System for Journal Articles using Opinion …thesai.org/Downloads/Volume8No12/Paper_27-Recommender_System… · Recommender frameworks have turned out to be amazingly

Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks

Course Recommender

Personalized Recommender by Exploiting Domain based Expert ... · Incremental Collaborative Filtering Recommender system. B. Content-Based Recommender System Content based recommender

Recommender Systems Handbook - Home - Springer978-0-387-858… · · 2017-08-28Printed on acid-free paper ... interacting with recommender systems, recommender sys- ... 1.6 Recommender

Recommender Introduction to Recommender Systems and

Benchmarking Applications and Frameworks

Recommender Systems an Introduction Chapter07 Evaluating Recommender Systems

Recommender Systems

Movie genome: alleviating new item cold start in movie … · MM Multimedia RS Recommender systems VRS Video recommender systems MRS Movie recommender systems MMRS Multimedia recommender

Next-generation Metrics for Scholarly Performance · 2019. 8. 6. · •personalized recommendations based on reading history •related articles •recommender frameworks –implement

Open Recommender

Recommender Systems - Universidade NOVA de Lisboactp.di.fct.unl.pt/~jmag/ws/slides/b08 Recommender systems.pdf · Recommender systems •Recommender systems aim at suggesting new

A Fuzzy Recommender System for eElections - unifr.ch Fuzzy Recommender System for eElections 63 2 Recommender Systems for eCommerce According to Yager [4], recommender systems used

Benchmarking Research Management Systems · SOA Benchmarking is very broadly scoped Testing frameworks on Workflow Engines: Simple, artificial control flows as workload Performance

Recommender Systems Recommender Systems

INTRODUCTION TO RECOMMENDER - Leuphana … · SYSTEMS AND THEIR EVALUATION Olga ... recommender systems Introduction Properties of recommender ... 11 Tutorial: Recommender Problems

Tutorial: Recommender Systemswelling/teaching/CS77B... · Tutorial: Recommender Systems ... Recommender systems implementation & evaluation Product configuration systems Web mining

Comparison and Benchmarking of AI Models and Frameworks on