Upload
koeverstrep
View
265
Download
2
Embed Size (px)
Citation preview
October 8, 2014
Unifying Nearest Neighbors Collabora3ve Filtering
Koen Verstrepen Bart Goethals
Data: Binary, Posi>ve-‐only
3
Task: Find best recommenda>on for
4
5
By compu>ng scores
6
Focus on compu>ng
7
Mul>ple solu>ons for compu>ng s( , ) Matrix factoriza>on Nearest Neighbors Methods …
8
User-‐Based Nearest Neighbors:
Item-‐Based Nearest Neighbors:
Two solu>ons for compu>ng s( , )
9
User-‐Based Nearest Neighbors:
Item-‐Based Nearest Neighbors:
Two solu>ons for compu>ng s( , )
Two solu>ons for compu>ng
10
User-‐Based Nearest Neighbors:
Item-‐Based Nearest Neighbors:
s( , )
We observed
Different scores: Because different informa>on used: Both valuable Each ignore the other type of informa>on
11
≠
Research Ques>on
Can we clearly characterize how both algorithms ignore informa>on?
Can we avoid ignoring informa>on?
12
Contribu>on 1 & 2
13
Contribu>on 1 & 2: Unifying Reformula>on
14
Instead of the tradi>onal perspec>ve:
A novel, unifying reformula>on:
Contribu>on 1 & 2: Unifying Reformula>on
15
User-‐based and Item-‐based have different w( , )
16
Contribu>on 1 & 2: A novel w( , )
KUNN Unified Nearest Neighbors
Details
17
18
Computa>on of w( , )
User-‐based (reformula>on exis>ng alg.)
Item-‐based (reformula>on exis>ng alg.) KUNN (novel alg.)
19
If , we have evidence to work with
1234 : UB -‐ IB -‐ KUNN
19
20
If , we have evidence to work with
= 1
1234 : UB -‐ IB -‐ KUNN
20
only relevant to if also only relevant to if also
21
&
= 1
1234 : UB -‐ IB -‐ KUNN
21
only relevant to if also only relevant to if also
22
&
= 1 × 1
1234 : UB -‐ IB -‐ KUNN
22
needs to be sufficiently similar to
23
= 1 × 1
1234 : UB -‐ IB -‐ KUNN
23
needs to be sufficiently similar to
24
= 1 × 1 × 1
1234 : UB -‐ IB -‐ KUNN
24
needs to be sufficiently similar to
25
= 1 × 1
1234 : UB -‐ IB -‐ KUNN
25
needs to be sufficiently similar to
26
= 1 × 1 × 1
1234 : UB -‐ IB -‐ KUNN
26
sufficiently similar to à +1 sufficiently similar to à +1 otherwise à 0
27
= 1 × 1
1234 : UB -‐ IB -‐ KUNN
27
sufficiently similar to à +1 sufficiently similar to à +1 otherwise à 0
28
= 1 × 1 × 2
1234 : UB -‐ IB -‐ KUNN
28
‘s are less informa>ve if and like many things
= 1 × 1 × 1
1234 : UB -‐ IB -‐ KUNN
29
‘s are less informa>ve if and like many things
= 1 × 1 × 1 ×
Unifying Nearest Neighbors Collaborative Filtering
Koen Verstrepen Bart Goethals
University of Antwerp
Antwerp, Belgium
{koen.verstrepen,bart.goethals}@uantwerp.be
ABSTRACTWe study collaborative filtering for applications in whichthere exists for every user a set of items about which theuser has given binary, positive-only feedback (one-class col-laborative filtering). Take for example an on-line store thatknows all past purchases of every customer. An importantclass of algorithms for one-class collaborative filtering arethe nearest neighbors algorithms, typically divided into user-based and item-based algorithms. We introduce a reformu-lation that unifies user- and item-based nearest neighborsalgorithms and use this reformulation to propose a novelalgorithm that incorporates the best of both worlds andoutperforms state-of-the-art algorithms. Additionally, wepropose a method for naturally explaining the recommen-dations made by our algorithm and show that this methodis also applicable to existing user-based nearest neighborsmethods.
Categories and Subject Descriptors: H.3.3 [Informa-tion Storage and Retrieval]: Information filtering
Keywords: One-class collaborative filtering, nearest neigh-bors, explaining recommendations, top-N recommendation,recommender systems.
1. INTRODUCTION
1p2 · 3
1p1 · 2
1p2 · 3 · 1 · 2
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or corecsysercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, October 6–10, 2014, Foster City, Silicon Valley, CA, USA.Copyright 2014 ACM 978-1-4503-2668-1/14/10 ...$15.00.http://dx.doi.org/10.1145/2645710.2645731.
u is a user
i is an item
u, v 2 U are users
i, j 2 I are items
The score for recommending item i to user u
Sum over all user-item-pairs (v, j)
Typically, the training data for collaborative filtering isrepresented by a matrix in which the rows represent theusers and the columns represent the items. A value in thismatrix can be unknown or can reflect the preference of therespective user for the respective item. Here, we consider thespecific setting of binary, positive-only preference feedback.Hence, every value in the preference matrix is 1 or 0, with1 representing a known preference and 0 representing theunknown. Pan et al. [10] call this setting one-class collab-orative filtering (OCCF). Applications that correspond tothis version of the collaborative filtering problem are likeson social networking sites, tags for photo’s, websites visitedduring a surfing session, articles bought by a customer etc.In this setting, identifying the best recommendations can
be formalized as ranking all user-item-pairs (u, i) accordingto the probability that u prefers i.One class of algorithms for solving OCCF are the nearest
neighbors algorithms. Sarwar et al. [13] proposed the wellknown user-based algorithm that uses cosine similarity andDeshpande et al. [2] proposed several widely used item-basedalgorithms.Another class of algorithms for solving OCCF are matrix
factorization algorithms. The state-of-the-art algorithms inthis class are Weighted Rating Matrix Factorization [7, 10],Bayesian Personalized Ranking Matrix Factorization [12] andCollaborative Less is More Filtering [14].This work builds upon the di↵erent item-based algorithms
by Deshpande et al. [2] and the user-based algorithm by Sar-war et al. [13] that uses cosine similarity. We introduce areformulation of these algorithms that unifies their existingformulations. From this reformulation, it becomes clear thatthe existing user- and item-based algorithms unnecessarilydiscard important parts of the available information. There-fore, we propose a novel algorithm that combines both user-
1234 : UB -‐ IB -‐ KUNN
30
31
‘s are less informa>ve if and are popular
= 1 × 1 × 1
1234 : UB -‐ IB -‐ KUNN
31
32
‘s are less informa>ve if and are popular
= 1 × 1 × 1 ×
Unifying Nearest Neighbors Collaborative Filtering
Koen Verstrepen Bart Goethals
University of Antwerp
Antwerp, Belgium
{koen.verstrepen,bart.goethals}@uantwerp.be
ABSTRACTWe study collaborative filtering for applications in whichthere exists for every user a set of items about which theuser has given binary, positive-only feedback (one-class col-laborative filtering). Take for example an on-line store thatknows all past purchases of every customer. An importantclass of algorithms for one-class collaborative filtering arethe nearest neighbors algorithms, typically divided into user-based and item-based algorithms. We introduce a reformu-lation that unifies user- and item-based nearest neighborsalgorithms and use this reformulation to propose a novelalgorithm that incorporates the best of both worlds andoutperforms state-of-the-art algorithms. Additionally, wepropose a method for naturally explaining the recommen-dations made by our algorithm and show that this methodis also applicable to existing user-based nearest neighborsmethods.
Categories and Subject Descriptors: H.3.3 [Informa-tion Storage and Retrieval]: Information filtering
Keywords: One-class collaborative filtering, nearest neigh-bors, explaining recommendations, top-N recommendation,recommender systems.
1. INTRODUCTION
1p2 · 3
1p1 · 2
1p2 · 3 · 1 · 2
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or corecsysercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, October 6–10, 2014, Foster City, Silicon Valley, CA, USA.Copyright 2014 ACM 978-1-4503-2668-1/14/10 ...$15.00.http://dx.doi.org/10.1145/2645710.2645731.
u is a user
i is an item
u, v 2 U are users
i, j 2 I are items
The score for recommending item i to user u
Sum over all user-item-pairs (v, j)
Typically, the training data for collaborative filtering isrepresented by a matrix in which the rows represent theusers and the columns represent the items. A value in thismatrix can be unknown or can reflect the preference of therespective user for the respective item. Here, we consider thespecific setting of binary, positive-only preference feedback.Hence, every value in the preference matrix is 1 or 0, with1 representing a known preference and 0 representing theunknown. Pan et al. [10] call this setting one-class collab-orative filtering (OCCF). Applications that correspond tothis version of the collaborative filtering problem are likeson social networking sites, tags for photo’s, websites visitedduring a surfing session, articles bought by a customer etc.In this setting, identifying the best recommendations can
be formalized as ranking all user-item-pairs (u, i) accordingto the probability that u prefers i.One class of algorithms for solving OCCF are the nearest
neighbors algorithms. Sarwar et al. [13] proposed the wellknown user-based algorithm that uses cosine similarity andDeshpande et al. [2] proposed several widely used item-basedalgorithms.Another class of algorithms for solving OCCF are matrix
factorization algorithms. The state-of-the-art algorithms inthis class are Weighted Rating Matrix Factorization [7, 10],Bayesian Personalized Ranking Matrix Factorization [12] andCollaborative Less is More Filtering [14].This work builds upon the di↵erent item-based algorithms
by Deshpande et al. [2] and the user-based algorithm by Sar-war et al. [13] that uses cosine similarity. We introduce areformulation of these algorithms that unifies their existingformulations. From this reformulation, it becomes clear thatthe existing user- and item-based algorithms unnecessarilydiscard important parts of the available information. There-fore, we propose a novel algorithm that combines both user-
1234 : UB -‐ IB -‐ KUNN
32
‘s are less informa>ve if and like many things ‘s are less informa>ve if and are popular
33
= 1 × 1 × 2
1234 : UB -‐ IB -‐ KUNN
33
‘s are less informa>ve if and like many things ‘s are less informa>ve if and are popular
34
= 1 × 1 × 2 ×
Unifying Nearest Neighbors Collaborative Filtering
Koen Verstrepen Bart Goethals
University of Antwerp
Antwerp, Belgium
{koen.verstrepen,bart.goethals}@uantwerp.be
ABSTRACTWe study collaborative filtering for applications in whichthere exists for every user a set of items about which theuser has given binary, positive-only feedback (one-class col-laborative filtering). Take for example an on-line store thatknows all past purchases of every customer. An importantclass of algorithms for one-class collaborative filtering arethe nearest neighbors algorithms, typically divided into user-based and item-based algorithms. We introduce a reformu-lation that unifies user- and item-based nearest neighborsalgorithms and use this reformulation to propose a novelalgorithm that incorporates the best of both worlds andoutperforms state-of-the-art algorithms. Additionally, wepropose a method for naturally explaining the recommen-dations made by our algorithm and show that this methodis also applicable to existing user-based nearest neighborsmethods.
Categories and Subject Descriptors: H.3.3 [Informa-tion Storage and Retrieval]: Information filtering
Keywords: One-class collaborative filtering, nearest neigh-bors, explaining recommendations, top-N recommendation,recommender systems.
1. INTRODUCTION
1p2 · 3
1p1 · 2
1p2 · 3 · 1 · 2
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or corecsysercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, October 6–10, 2014, Foster City, Silicon Valley, CA, USA.Copyright 2014 ACM 978-1-4503-2668-1/14/10 ...$15.00.http://dx.doi.org/10.1145/2645710.2645731.
u is a user
i is an item
u, v 2 U are users
i, j 2 I are items
The score for recommending item i to user u
Sum over all user-item-pairs (v, j)
Typically, the training data for collaborative filtering isrepresented by a matrix in which the rows represent theusers and the columns represent the items. A value in thismatrix can be unknown or can reflect the preference of therespective user for the respective item. Here, we consider thespecific setting of binary, positive-only preference feedback.Hence, every value in the preference matrix is 1 or 0, with1 representing a known preference and 0 representing theunknown. Pan et al. [10] call this setting one-class collab-orative filtering (OCCF). Applications that correspond tothis version of the collaborative filtering problem are likeson social networking sites, tags for photo’s, websites visitedduring a surfing session, articles bought by a customer etc.In this setting, identifying the best recommendations can
be formalized as ranking all user-item-pairs (u, i) accordingto the probability that u prefers i.One class of algorithms for solving OCCF are the nearest
neighbors algorithms. Sarwar et al. [13] proposed the wellknown user-based algorithm that uses cosine similarity andDeshpande et al. [2] proposed several widely used item-basedalgorithms.Another class of algorithms for solving OCCF are matrix
factorization algorithms. The state-of-the-art algorithms inthis class are Weighted Rating Matrix Factorization [7, 10],Bayesian Personalized Ranking Matrix Factorization [12] andCollaborative Less is More Filtering [14].This work builds upon the di↵erent item-based algorithms
by Deshpande et al. [2] and the user-based algorithm by Sar-war et al. [13] that uses cosine similarity. We introduce areformulation of these algorithms that unifies their existingformulations. From this reformulation, it becomes clear thatthe existing user- and item-based algorithms unnecessarilydiscard important parts of the available information. There-fore, we propose a novel algorithm that combines both user-
1234 : UB -‐ IB -‐ KUNN
34
In summary
NN algorithms reformulated as sum over user-‐item pair One such pair is ( , ) For compu>ng the contribu>on of the pair ( , ) Exis>ng user-‐based methods ignore info about Exis>ng item-‐based methods ignore info about KUNN combines info about and
35
One More Thing
36
Contribu>on 3
37
Explaining recommenda>ons
38
Explaining recommenda>ons
39
Explaining recommenda>ons
40
Explaining recommenda>ons
41
Read our paper because?
1. Our reformula>on could inspire you
2. You might want to benchmark with KUNN (link to source code in paper)
3. You want to explain NN recommenda>ons (incl. user-‐based)
42
Accuracy Results Rela>ve to POP
44
experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.
5.1 User Selected SetupThis experimental setup is applicable to the Movielens
and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H
u
of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h
u
}, or the empty set if no test preference ischosen. Furthermore, we define U
t
as the set of users witha test preference, i.e. U
t
= {u 2 U | |Hu
| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U
t
, every algorithm ranks the items{i 2 I | R
ui
= 0} based on R. We denote the rank of thetest preference h
u
in such a ranking as r(hu
).Then, every set of rankings is evaluated using three mea-
sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by
HR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)|,
with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.
Average reciprocal hit rate at 10 is given by
ARHR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)| · 1r(h
u
).
Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.
Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by
AUCAMAN =1
|Ut
|X
u2Ut
|I|� r(hu
)|I|� 1
.
AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h
u
) increases.We repeat all experiments five times, drawing a di↵erent
random sample of test preferences every time.
5.2 Random Selected SetupThe user selected experimental setup introduces two bi-
ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are
thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-
structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset
Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H
u
contains all preferences of this user which are present in thetest dataset. The set of dislikes M
u
contains all dislikes ofthis user which are present in the test dataset. We defineUt
= {u 2 U | |Hu
| > 0, |Mu
| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U
t
, every algorithm ranks the itemsin H
u
[Mu
based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of
rankings with the AMAU version of area under the curve,which is given by
AUCAMAU =1
|Ut
|X
u2Ut
AUCAMAU (u),
with
AUCAMAU (u) =X
h2Hu
|{m 2 Mu
| r(m) > r(h)}||H
u
||Mu
| .
AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,
repeating the experiment with di↵erent test preferences isnot applicable.
5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-
uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset
Su2Ut
Hu
, (3) the outer dislikesS
u2Ut
Mu
,
(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,
Su2Ut
Hk
u
, andS
u2Ut
Mu
for
k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training
experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.
5.1 User Selected SetupThis experimental setup is applicable to the Movielens
and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H
u
of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h
u
}, or the empty set if no test preference ischosen. Furthermore, we define U
t
as the set of users witha test preference, i.e. U
t
= {u 2 U | |Hu
| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U
t
, every algorithm ranks the items{i 2 I | R
ui
= 0} based on R. We denote the rank of thetest preference h
u
in such a ranking as r(hu
).Then, every set of rankings is evaluated using three mea-
sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by
HR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)|,
with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.
Average reciprocal hit rate at 10 is given by
ARHR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)| · 1r(h
u
).
Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.
Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by
AUCAMAN =1
|Ut
|X
u2Ut
|I|� r(hu
)|I|� 1
.
AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h
u
) increases.We repeat all experiments five times, drawing a di↵erent
random sample of test preferences every time.
5.2 Random Selected SetupThe user selected experimental setup introduces two bi-
ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are
thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-
structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset
Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H
u
contains all preferences of this user which are present in thetest dataset. The set of dislikes M
u
contains all dislikes ofthis user which are present in the test dataset. We defineUt
= {u 2 U | |Hu
| > 0, |Mu
| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U
t
, every algorithm ranks the itemsin H
u
[Mu
based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of
rankings with the AMAU version of area under the curve,which is given by
AUCAMAU =1
|Ut
|X
u2Ut
AUCAMAU (u),
with
AUCAMAU (u) =X
h2Hu
|{m 2 Mu
| r(m) > r(h)}||H
u
||Mu
| .
AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,
repeating the experiment with di↵erent test preferences isnot applicable.
5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-
uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset
Su2Ut
Hu
, (3) the outer dislikesS
u2Ut
Mu
,
(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,
Su2Ut
Hk
u
, andS
u2Ut
Mu
for
k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training
experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.
5.1 User Selected SetupThis experimental setup is applicable to the Movielens
and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H
u
of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h
u
}, or the empty set if no test preference ischosen. Furthermore, we define U
t
as the set of users witha test preference, i.e. U
t
= {u 2 U | |Hu
| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U
t
, every algorithm ranks the items{i 2 I | R
ui
= 0} based on R. We denote the rank of thetest preference h
u
in such a ranking as r(hu
).Then, every set of rankings is evaluated using three mea-
sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by
HR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)|,
with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.
Average reciprocal hit rate at 10 is given by
ARHR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)| · 1r(h
u
).
Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.
Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by
AUCAMAN =1
|Ut
|X
u2Ut
|I|� r(hu
)|I|� 1
.
AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h
u
) increases.We repeat all experiments five times, drawing a di↵erent
random sample of test preferences every time.
5.2 Random Selected SetupThe user selected experimental setup introduces two bi-
ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are
thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-
structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset
Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H
u
contains all preferences of this user which are present in thetest dataset. The set of dislikes M
u
contains all dislikes ofthis user which are present in the test dataset. We defineUt
= {u 2 U | |Hu
| > 0, |Mu
| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U
t
, every algorithm ranks the itemsin H
u
[Mu
based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of
rankings with the AMAU version of area under the curve,which is given by
AUCAMAU =1
|Ut
|X
u2Ut
AUCAMAU (u),
with
AUCAMAU (u) =X
h2Hu
|{m 2 Mu
| r(m) > r(h)}||H
u
||Mu
| .
AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,
repeating the experiment with di↵erent test preferences isnot applicable.
5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-
uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset
Su2Ut
Hu
, (3) the outer dislikesS
u2Ut
Mu
,
(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,
Su2Ut
Hk
u
, andS
u2Ut
Mu
for
k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training
experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.
5.1 User Selected SetupThis experimental setup is applicable to the Movielens
and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H
u
of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h
u
}, or the empty set if no test preference ischosen. Furthermore, we define U
t
as the set of users witha test preference, i.e. U
t
= {u 2 U | |Hu
| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U
t
, every algorithm ranks the items{i 2 I | R
ui
= 0} based on R. We denote the rank of thetest preference h
u
in such a ranking as r(hu
).Then, every set of rankings is evaluated using three mea-
sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by
HR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)|,
with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.
Average reciprocal hit rate at 10 is given by
ARHR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)| · 1r(h
u
).
Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.
Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by
AUCAMAN =1
|Ut
|X
u2Ut
|I|� r(hu
)|I|� 1
.
AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h
u
) increases.We repeat all experiments five times, drawing a di↵erent
random sample of test preferences every time.
5.2 Random Selected SetupThe user selected experimental setup introduces two bi-
ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are
thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-
structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset
Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H
u
contains all preferences of this user which are present in thetest dataset. The set of dislikes M
u
contains all dislikes ofthis user which are present in the test dataset. We defineUt
= {u 2 U | |Hu
| > 0, |Mu
| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U
t
, every algorithm ranks the itemsin H
u
[Mu
based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of
rankings with the AMAU version of area under the curve,which is given by
AUCAMAU =1
|Ut
|X
u2Ut
AUCAMAU (u),
with
AUCAMAU (u) =X
h2Hu
|{m 2 Mu
| r(m) > r(h)}||H
u
||Mu
| .
AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,
repeating the experiment with di↵erent test preferences isnot applicable.
5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-
uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset
Su2Ut
Hu
, (3) the outer dislikesS
u2Ut
Mu
,
(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,
Su2Ut
Hk
u
, andS
u2Ut
Mu
for
k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training
experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.
5.1 User Selected SetupThis experimental setup is applicable to the Movielens
and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H
u
of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h
u
}, or the empty set if no test preference ischosen. Furthermore, we define U
t
as the set of users witha test preference, i.e. U
t
= {u 2 U | |Hu
| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U
t
, every algorithm ranks the items{i 2 I | R
ui
= 0} based on R. We denote the rank of thetest preference h
u
in such a ranking as r(hu
).Then, every set of rankings is evaluated using three mea-
sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by
HR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)|,
with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.
Average reciprocal hit rate at 10 is given by
ARHR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)| · 1r(h
u
).
Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.
Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by
AUCAMAN =1
|Ut
|X
u2Ut
|I|� r(hu
)|I|� 1
.
AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h
u
) increases.We repeat all experiments five times, drawing a di↵erent
random sample of test preferences every time.
5.2 Random Selected SetupThe user selected experimental setup introduces two bi-
ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are
thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-
structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset
Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H
u
contains all preferences of this user which are present in thetest dataset. The set of dislikes M
u
contains all dislikes ofthis user which are present in the test dataset. We defineUt
= {u 2 U | |Hu
| > 0, |Mu
| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U
t
, every algorithm ranks the itemsin H
u
[Mu
based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of
rankings with the AMAU version of area under the curve,which is given by
AUCAMAU =1
|Ut
|X
u2Ut
AUCAMAU (u),
with
AUCAMAU (u) =X
h2Hu
|{m 2 Mu
| r(m) > r(h)}||H
u
||Mu
| .
AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,
repeating the experiment with di↵erent test preferences isnot applicable.
5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-
uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset
Su2Ut
Hu
, (3) the outer dislikesS
u2Ut
Mu
,
(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,
Su2Ut
Hk
u
, andS
u2Ut
Mu
for
k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training
experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.
5.1 User Selected SetupThis experimental setup is applicable to the Movielens
and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H
u
of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h
u
}, or the empty set if no test preference ischosen. Furthermore, we define U
t
as the set of users witha test preference, i.e. U
t
= {u 2 U | |Hu
| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U
t
, every algorithm ranks the items{i 2 I | R
ui
= 0} based on R. We denote the rank of thetest preference h
u
in such a ranking as r(hu
).Then, every set of rankings is evaluated using three mea-
sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by
HR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)|,
with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.
Average reciprocal hit rate at 10 is given by
ARHR@10 =1
|Ut
|X
u2Ut
|Hu
\ top10 (u)| · 1r(h
u
).
Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.
Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by
AUCAMAN =1
|Ut
|X
u2Ut
|I|� r(hu
)|I|� 1
.
AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h
u
) increases.We repeat all experiments five times, drawing a di↵erent
random sample of test preferences every time.
5.2 Random Selected SetupThe user selected experimental setup introduces two bi-
ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are
thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-
structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset
Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H
u
contains all preferences of this user which are present in thetest dataset. The set of dislikes M
u
contains all dislikes ofthis user which are present in the test dataset. We defineUt
= {u 2 U | |Hu
| > 0, |Mu
| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U
t
, every algorithm ranks the itemsin H
u
[Mu
based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of
rankings with the AMAU version of area under the curve,which is given by
AUCAMAU =1
|Ut
|X
u2Ut
AUCAMAU (u),
with
AUCAMAU (u) =X
h2Hu
|{m 2 Mu
| r(m) > r(h)}||H
u
||Mu
| .
AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,
repeating the experiment with di↵erent test preferences isnot applicable.
5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-
uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset
Su2Ut
Hu
, (3) the outer dislikesS
u2Ut
Mu
,
(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,
Su2Ut
Hk
u
, andS
u2Ut
Mu
for
k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training
Movielens 1M Yahoo!Music
KUNN WRMF BPRMF UB+IB UB IB POP