Implementing a scalable recommender system for social networksliu.diva-portal.org/smash/get/diva2:1117096/FULLTEXT01.pdf · anyone to read, to download, to print out single copies

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet

gnipökrroN 47 106 nedewS ,gnipökrroN 47 106-ES

LiU-ITN-TEK-A--17/031--SE

Implementing a scalablerecommender system for social

networksAlexander Cederblad

2017-06-08

LiU-ITN-TEK-A--17/031--SE

Implementing a scalablerecommender system for social

networksExamensarbete utfört i Datateknik

vid Tekniska högskolan vidLinköpings universitet

Alexander Cederblad

Handledare Pierangelo DellAcquaExaminator Camilla Forsell

Norrköping 2017-06-08

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –under en längre tid från publiceringsdatum under förutsättning att inga extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat förickekommersiell forskning och för undervisning. Överföring av upphovsrättenvid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning avdokumentet kräver upphovsmannens medgivande. För att garantera äktheten,säkerheten och tillgängligheten finns det lösningar av teknisk och administrativart.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovanbeskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådanform eller i sådant sammanhang som är kränkande för upphovsmannens litteräraeller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press seförlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possiblereplacement - for a considerable time from the date of publication barringexceptional circumstances.

The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for your own use and touse it unchanged for any non-commercial research and educational purpose.Subsequent transfers of copyright cannot revoke this permission. All other usesof the document are conditional on the consent of the copyright owner. Thepublisher has taken technical and administrative measures to assure authenticity,security and accessibility.

According to intellectual property law the author has the right to bementioned when his/her work is accessed as described above and to be protectedagainst infringement.

For additional information about the Linköping University Electronic Pressand its procedures for publication and for assurance of document integrity,please refer to its WWW home page: http://www.ep.liu.se/

© Alexander Cederblad

Abstract

Large amounts of items and users with different characteristics and preferences make

personalized recommendations a problem. Many companies employ recommender sys-

tems to solve the problem of discovery and information overload where it is unreasonable

for a user to go through all items to find something interesting. Recommender systems as

a field of research has become popular during the past two decades. Recommendations are

for many companies an important aspect of their products concerning user experience and

revenue. This master’s thesis describes the development and evaluation of a recommender

system in the context of a social network for sports fishing called Fishbrain. It describes

and evaluates several different approaches to recommender systems. It reasons about user

characteristics, user interface, and the feedback data provided by the users, for which help

make recommendations. The work aims to improve user experience in the given context.

All this has been implemented and evaluated, with mixed results, considering the many

variables taken into account that are important to Fishbrain.

Acknowledgments

I would like to thank Fishbrain for allowing me to finish my thesis under their stewardship.I express my humble gratitude to Niklas Andersson for helping me brainstorming ideas inthe inception of the project, the continuous support during the entire project, and for keepingme on track. Also from Fishbrain, I would like to thank Mattias Lundell for taking the timeto review my work and giving me valuable feedback, and also for showing a special interestduring my time working on the thesis. Thanks to Ariel Ekgren for challenging my criticalthinking during data analysis.

Finally I would like to thank my supervisor from the University, Pierangelo Dell’Acquafor valuable feedback and peppiness during meetings, without which I would not have beenable to finish the work.

iv

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Discovery in similar services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 6

2.1 Recommender systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Content-based filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Collaborative filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Hybrid recommender systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Non-traditional approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Cold start problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.7 Feedback data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.8 Data sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.9 Explaining recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.10 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Method 13

3.1 Feedback data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Evaluation in stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

v

4 Results 20

4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Evaluation framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Offline evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Online evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Discussion 23

5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Conclusion 25

6.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Bibliography 27

List of Figures

1.1 (a) Screenshot of the Explore view where users can explore content from specificgeographical areas by browsing a world map while filtering on species, fishingmethods, etc. (b) The current Discover view where users can search and followusers, fish species, and fishing methods that may interest them at their own voli-tion. Below these options is a list of the latest catches globally. . . . . . . . . . . . . 2

1.2 Showcasing two similar services and their presentation of recommendations. . . . 31.3 Showcasing two similar services and their presentation of recommendations. . . . 4

3.1 Diagram of the recommender system infrastructure. Arrows indicate dependency,not flow of information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Recommendations as presented on the two different platforms. . . . . . . . . . . . 18

vii

List of Tables

3.1 The number of users for each dataset is relative to the number of users in thesparse, long time frame dataset. A larger number in mean and median is also anindicator of the amount of data for each user, and as stated in previous chapter,more data implies better recommendations. . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Table presenting the offline evaluation results for the different approaches testedon two different datasets. Hyphenated cells are missing measurements for thegiven approach. Cells with bold numbers illustrates the argued best result. ForP@10, the number is a fraction of the recommended items that are in the test set.For RMSE, the number is the numerical error for the predicted ratings. . . . . . . . 21

4.2 Online evaluation results for users globally. The numbers in the cells represent thepercentage (%) change in conversion rates measured. Cells with bold text indicatechanges with a reasonable probability level (* 80 %, ** 95 %), to determine theresults as statistically significant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Online evaluation results for users in the USA. . . . . . . . . . . . . . . . . . . . . . 224.4 Online evaluation results for users in Sweden. . . . . . . . . . . . . . . . . . . . . . 22

viii

1 Introduction

1.1 Motivation

Fishbrain is a company developing a social network- and utility application (also called Fish-brain) for anglers. The application is available for iOS and Android. It features utility func-tionality for preparing fishing trips and enables socializing with other anglers by sharing theirfishing experiences. The social network has over 3,500,000 users. Fishbrain provides similarfunctionality as featured in other social networks; a social feed where users can keep up withother users (and/or other entities) that they follow, a profile page, a few facilities for findingnew friends and content, etc. In addition the applications also provide utility functionalitylike weather (or rather fishing prospect) forecasts, and a map where anglers can explore newbodies of water to visit and fish.

Fishing is a highly seasonal activity. For the main target group, located in the UnitedStates, high season occurs around May through August. During this time users do mostof their yearly fishing, this means time on the water, posting catches, and engaging in theutility side of the application. During this time the current geographical neighborhood of theusers are especially important. Users want to know when fish start to bite in the waters theyusually fish. During off-season, users are more inclined to browse through content from thepast season and explore fishing activity in their own vicinity but also around the world.

The most important type of content to Fishbrain are the catches that users share. Theyinclude information about the catch (location, time of day, species, weight, bait, etc.) accom-panied by a photo and optionally a short text. Users may also share fishing related statusupdates, images, and videos without it being a logged catch. Users consuming the contentprovided by other users may provide feedback in the form of a "like" and/or comments. Thisinformation can be used to improve the user experience by providing relevant and personal-ized recommendations of content to users.

Arguably the most important thing to the social network, putting aside the utility side ofthe application, is that users interact with each other and engage in the community. Thereforeit is important to make sure the users have a good experience while using the application,and to keep them engaged such that they want to come back. The hypothesis of the thesis isthat improving content discovery and exploration will improve overall user experience. Thecurrent facilities for exploration and discovery are presented in Figure 1.1. They currentlylack in the ability to produce personalized recommendations for users.

1

1.2. Discovery in similar services

(a) Explore view (b) Discover view

Figure 1.1: (a) Screenshot of the Explore view where users can explore content fromspecific geographical areas by browsing a world map while filtering on species, fish-ing methods, etc. (b) The current Discover view where users can search and followusers, fish species, and fishing methods that may interest them at their own volition.Below these options is a list of the latest catches globally.

1.2 Discovery in similar services

There are several social networks with similar user interaction that provide personalized rec-ommendations for content discovery to their users.

Instagram allow their users to interact with each other similarly to Fishbrain. They use thefollow/followed paradigm of connecting users and let users "like" and comment poststhat users make in the form of images and videos. They provide a "Discover" view(Figure 1.2a) with a grid of recommended posts. Each post is accompanied by an ex-planation as to why it was recommended (e.g. "based on people you like", "based onpeople you follow", etc.) as well as the option for users to mark it as not desired byletting users opt in for seeing "fewer posts like this", which may be used as feedback tomake better recommendations. The Instagram engineering team published an article in2015 describing their approach to discovery [10].

YouTube previously had a five-star rating system for their videos, they have since switchedto likes and dislikes as the main feedback users can provide. They also have com-ments enabled and the ability to mark videos as your favourites. Aside from these ex-plicit forms of feedback, they can also track implicit feedback in the form of how manyviews, how much of the video is watched, how many times a user replays a video, etc.Researchers at Google have published a few papers describing how they might makerecommendations on YouTube [6, 8, 33].

2

1.2. Discovery in similar services

(a) Instagram discover. (b) YouTube recommendations.

Figure 1.2: Showcasing two similar services and their presentation of recommendations.

Netflix has been involved in the recommender system community arguably more than anyother company. Since they sponsored the Netflix Prize offering 1,000,000 USD for im-proving their then algorithm CineMatch. The winning solution "BellKor PragmaticChaos" included a collection of predictors, and the development have been featuredin many research paper, the final solution being described in [18]. The competitionsparked an interest in recommender systems and the outcome is a lot of contributionsto the field. Netflix used to have five-star ratings system, however recently they de-cided to change their ratings system to the "thumbs up"/"thumbs down" system, theirmotivation being that users were not being totally honest and that Netflix would be ableto make better recommendations using this paradigm. This sparked controversy on so-cial media amongst users who meticulously has kept ratings on their accounts, worriedthat recommendations would get worse. Amatriain et al. authored a case study aboutrecommendations at Netflix [2].

Pinterest is a photo sharing website where users can catalog their ideas and inspiration offood recipes, interior design, fashion, etc. Users are recommended items that are similarto the content a users are interested in and also related "pins", as the items are calledwhen collected by a user. Some of the company’s approaches are described in [15].

The listed companies, and other companies, rely heavily on personalized recommendedcontent and so it is an important aspect from a business perspective [30], due to a significantchunk of revenue that is generated that way. YouTube has reported that 60 % of clicks fromtheir homepage are recommended content [8]. Almost everything on Netflix is a recommen-dation, and is therefore an important part of their business model [2]. In 2015 Gomez-Uribeet al. [11] claimed that two out of three hours of watched content was from recommenda-tions. Matchmaking-/dating websites which rely heavily on recommender systems is an

3

1.3. Aim

(a) Netflix recommendations. (b) Pinterest recommendations.

Figure 1.3: Showcasing two similar services and their presentation of recommendations.

enormous industry [16]. Music recommendations is an application used by many music ser-vice providers, e.g. Spotify [4].

Recommender systems can be used to produce targeted ads. However there are somedifferences that should be discussed. Many targeted ads systems rely heavily on a real-timebidding system where advertisers can bid on showing their ads to users by selecting a targetgroup considering a variable amount of characteristics; demographics, interests, etc. Rec-ommender systems rather focuses on providing items that users would not have found bythemselves, items that are a total surprise or an unexpected coincidence. This concept iscalled serendipity and will be referred to as such henceforth. Targeted ads show relevantitems coming from a highest bidder, or some other metric maximizing revenue. This biddingwould for a recommender system be considered cheating the system and through that: bro-ken. An example for Netflix is described by Amatriain et al. [2] where they claim that thecost for streaming is similar for every item. This means that they can focus on recommendingthe best item for their users instead of having to directly account for revenue while makingrecommendations.

1.3 Aim

The aim of this project is to improve the user experience in the application by providingpersonalized recommendations to the users in the social network. These recommendationsshould help users to engage with other users by giving feedback and expanding their ownsocial network. The recommender system should be able to recommend items to the users thatare relevant for them, based on their social network and personal preference. The solutionshould also be able to scale as the user base grows without compromising user experience.

4

1.4. Research questions

The project demands exploring the data available on user interaction of the Fishbrainapplication and how users are connected to be able to reason and hypothesize about the bestway to make recommendations.

Many companies and social networks have similar ways of recommending items and pre-senting recommendations to their users. However they target different groups of people.Their users are different in behaviour, and their applications differ in user experience. Thesedifferences needs to be taken into account when designing the recommender system.

1.4 Research questions

1. How does one evaluate different approaches to recommender systems, and more specif-ically different models; offline while developing, and online with real users?

2. How well do traditional recommender systems based on collaborative filtering performwhen having only one-class, positive ratings as opposed to other types of ratings five-star ratings, and implicit feedback?

3. Which user characteristics needs to be taken into account when developing a recom-mender system? How do different users respond to recommendations?

4. Can the recommender system help increase the use of the Fishbrain application?

1.5 Delimitations

There are many approaches to producing personalized content. Therefore, to narrow thescope of the thesis a delimitation to the underlying data used to produce recommendationsis set. User feedback will be exclusively used as basis for making recommendations. Goingfurther, the thesis will only consider explicit feedback that users give. Not implicit feedbackcollected from user behaviour in the application.

5

2 Theory

2.1 Recommender systems

Recommender systems are systems that suggest items that may be interesting to a user. Forexample, they can be used to help make decisions of what to buy, what to listen to, and whatto read. They solve the problem of information overload where it is unreasonable for a user towork through all items to find something of interest, or in the case of search; what to actuallysearch for.

There has been a lot of work in the field of recommender systems and it is an active field ofresearch. There is an ACM Conference Series (RecSys1) exclusively focused on the subject. In2006 Netflix sponsored a contest that boosted the interest and research made on the subject.

Recommender systems are typically categorized as follows:

Content-based filtering Recommendations are based on the similarity between user prefer-ence and item profiles. Item profiles consist of important characteristics of that item.Text documents, for example, could be represented in vector space by important key-words, movies could be represented by genre and actors.

Collaborative filtering The recommendations are calculated based on the similarity betweendifferent users considering their collective interactions with items (e.g. ratings, likes,and dislikes). This approach is therefore domain agnostic considering the items andusers.

Hybrid recommender Combining different approaches has been shown to be successful formany cases. These types of recommender systems can be a mix of collaborative- andcontent-based filtering together with information about the explicit user preference(also known as knowledge-based recommender systems), demographic information,etc.

Even though the subject of recommender systems has been explored for many years, it isnot trivial to implement a well performing recommender system. Data needs to be collectedand groomed, and models needs to be chosen, and tweaked for the specific domain. Rec-ognizing the domain is important since the type of underlying data available differ between

1https://recsys.acm.org/

6

2.2. Content-based filtering

domains. It is also important to recognize that users are different, both as individuals anddepending on the domain they find themselves in. Therefore it is not trivial to choose an ap-proach without evaluating the recommender system properly to make sure it performs well.Evaluation can be conducted while developing by testing the hypothesis by measuring theerror and prediction precision of the model. This can help steering the development in thecorrect direction to improve on the model. However it is not until actually testing the systemwith real users that a conclusion can be drawn on the performance of the recommendations.

The recommender problem involves predicting the user rating for an item and also rankthe predictions so that the system can provide recommendations based on these predictedratings. From the set of users U and items I , the rating can be expressed as:

r : U ˆ I Ñ R (2.1)

The goal is to make predictions every users such that for a user u rank the best predictionfrom the set of items I (Equation 2.2) or rather rank the predictions and produce an orderedtop-N sequence of the best predictions.

prediction = argmaxjPI

(r(u, j)) (2.2)

2.2 Content-based filtering

Content-based filtering is based on item characteristics and user preferences [21]. Items rec-ommended have similar characteristics as items preferred by the users in the past. Thesecharacteristics can be labels describing an item, like a genre or the director of a movie, or aderived labels from the collection of items (e.g. TF-IDF [28]), etc. Thus, an item can be de-scribed by a vector of labels, called the feature vector. Similarly to items, user preferencescan be described by preference vectors. It is then fairly straightforward to recommend itemsbased on the similarity to each other by some measurement. One common measurement forcomparing vectors is by calculating the angle between the vectors. Selecting the items forwhich the vectors have the smallest angles gives the most similar items.

Another approach is to create decision trees for each users [28]. Decision tree learning isa predictive model used in machine learning. The decision tree is represented as a binarytree where each non-leaf node is a condition of some feature of the item. The leaf nodes thenrepresent the decision, in the case of recommender systems, if the user would prefer an itemor not.

The main advantage of content-based filtering is that they are independent of other users.It is only the relationship between the user in question and the items that are considered thatis taken into account. This helps eliminate the cold-start problem (described more in detail insection 2.6) for new items. There are also drawbacks with content-based filtering. Contentlabeling is a hard problem to solve such that recommendations are something unexpected(serendipity) and interesting to the users. Consider a movie recommender system: a content-based recommender system would easily be able to recommend movies within the samegenres, and with the same directors that a user likes, but would fail to recommend somethingnovel. The cold-start problem remain for new users.

2.3 Collaborative filtering

Collaborative filtering takes advantage of users’ collective ratings of items. It uses this infor-mation to predict ratings based on similar users in respect to their ratings. Given a set of usersU , the set of items I , the ratings matrix R is a |U | ˆ |I | matrix describing ratings for items andusers. The columns of R represent items and its rows represent users. A rating for an item iand a user u is written as rui.

7

2.3. Collaborative filtering

Collaborative filtering is based on the assumption that similar users display similar ratingpatterns. Essentially the assumption made for the prediction is that, based on users’ similarpast behaviour, users will have similar preferences in the future. Collaborative filtering canbe divided into two different approaches memory based and model based.

Memory based approach

The memory based approach [5], sometimes called the neighborhood approach, uses the entireset of ratings directly in the ratings calculations. It is called the memory based approachsince it uses all ratings to produce recommendations. Similar users are found by a similaritymeasurement sim(u, v), where u and v are two users. One commonly used similarity mea-surement is the Pearson’s correlation coefficient. It takes into account the users ratings inrespect to their average rating. This similarity measurement is then used in the predictioncalculation. The prediction pred(u, i), where u is a user, i is an item and v̄u is the averagerating for a user u, is then often expressed as a weighted sum over the other users:

pred(u, i) = v̄u +

ř

vPU

sim(u, v)(ruv ´ v̄v)

ř

vPU

sim(u, v)(2.3)

The performance of the memory based approach decreases as the data grows, and datasparsity increases, which is more often than not the case for recommendation problems (de-scribed in section 2.8). Since the entire ratings matrix needs to be considered, it is not suitableto use this approach for applications with massive user bases.

There have been attempts, with mixed results, to improve the performance of memorybased recommender systems by using compression techniques to overcome memory con-straints [31].

Model based approach

The model based approach involves uncovering latent factors from the observed ratings.These latent factor models can be created in various ways, most commonly by matrix fac-torization, or singular value decomposition (SVD), which involves transforming the items andusers into latent factor spaces. These latent factor spaces tries to characterize the users anditems inferred by the collective ratings. Other methods for uncovering these latent factors in-clude neural networks, Bayesian networks, probabilistic LSA, and latent Dirichlet allocation.

Although memory based approaches are more precise in the theoretical sense, since thecalculations use all the data available, they do, as previously mentioned, suffer from perfor-mance issues at scale. The model based approach produces well performing results even forlarge, sparse datasets.

Some of the best latent factor models are based on matrix factorization [19]. Matrix fac-torization approaches are popular since they combine scalability and predictive accuracy.Considering the ratings matrix R, it can be factored into matrices of lesser dimensions:

R « R̂ = PQT (2.4)

where P is a |U | ˆ k matrix and Q is a |I | ˆ k matrix, where k is the number of latentfactors. Refer to equation 2.5 for an example illustration of this approximation, with k = 2.

PQT =

p11 p12

p21 p22

p31 p32

ˆ

[

q11 q12 q13 q14

q21 q22 q23 q24

]

=

r̂11 r̂12 r̂13 r̂14

r̂21 r̂22 r̂23 r̂24

r̂31 r̂32 r̂33 r̂34

(2.5)

In the most common and simple approach these matrices are estimated by minimizing theerror of the approximation in equation 2.4.

8

2.4. Hybrid recommender systems

For recommender systems, most elements in the ratings matrix will be unknown. Earliersystems used to rely on imputation to fill the missing values to make the matrix dense. Thisdecreases performance since it increases the amount of data and inaccurate imputation issensitive to overfitting. More commonly today is to model only the observed ratings in theratings matrix. Overfitting the model is avoided using a regularized model. So factors arelearned by minimizing the regularized squared error (2.6) where r̂ is a predicted rating, P andQ are the latent factor matrices for users and items, R is the regularization function, and λ

is a constant controlling the amount of regularization. Different approaches to regularizationfor SVD are described in [26].

ǫ = minP,Q

ÿ

u,iPR

(rui ´ r̂ui)2 + λR(P, Q) (2.6)

There are a few learning algorithms to consider for solving the problem of minimizingthe error. Two common ones are stochastic gradient decent (SGD) and alternating least squares(ALS). The former algorithms is the simpler one, which can easily be implemented and worksreasonably well. The latter can be slower but it has the advantage of being able to parallelizethe computation and in systems with implicit data [14], which will be further explained in theupcoming section 2.7.

Matrix factorization enables flexibility and customized solutions by adding biases. Theseare values that describe variations in individual users and items; users giving higher ratingsthan others, items receiving higher ratings than others. This can be an effect of items col-lectively being considered better than others, and instead of users’ true preference. That iswhy biases can help to better predict ratings than only by the user interactions. Biases can forexample be used to handle temporal effects to ratings. These effects being that users changetheir ratings behaviour over time, or that items change in perception and therefore changetheir given ratings.

2.4 Hybrid recommender systems

Hybrid recommender systems combine techniques from different approaches, often from col-laborative filtering and content-based filtering. It can also include other techniques for mak-ing recommendations e.g. collecting demographic information from users or having userssupply preference explicitly to draw conclusions. The two latter mentioned techniques canbe used to overcome the cold-start problem which is covered in more detail in section 2.6.

The key aspect is about how to combine the different techniques, in fact they can beweighted, preferring better performing parts of the hybrid, or mixed to provide diversity,etc.

2.5 Non-traditional approaches

Since the recommendations handled in this report is in a social network context, graphs havea natural fit to solving the recommendation problem.

Graph databases could be utilized to make recommendations based on social connections.Such a system could recommend people to follow based on existing relationships by execut-ing a simple query language statement. There are some limitations to recommending itemswith graph databases, the most apparent being to recommend an item when no relationshipexists. It is easy recommend a friend of a friend, but to recommend a distant user with sim-ilar preferences, the graph database is prone to quickly grow in size and complexity. Graphdatabases could possibly be part of a hybrid recommender system.

9

2.6. Cold start problem

2.6 Cold start problem

The cold start problem arises when users have not yet given any feedback, and the counter-part when items have not yet been given any feedback [12]. Then there is no clear way ofderiving user preference, and respectively item appeal.

There are a few approaches that can be considered for making recommendations whenuser feedback is not available. As with content-based filtering, labels may be derived fromthe items through analysing the content, image recognition, location, etc. or they might be,by compulsion, provided by the user. As for user preference there may be things that can bederived from meta data, e.g. location, age, gender, etc. These features can then be used to rec-ommend items similar to these user profiles. Important to note is that these labels and profilesdo not describe explicit preference for the individual items, therefore top rated items, itemsin geographical proximity, etc. can be used to to weigh in to improve the recommendations.

The interface providing the recommendations to the user may feature a cold-start viewforcing users to specify a starting point for their preference before serving recommendations.The downside to this solution is that there is an extra hurdle for the users before being ableto interact with the recommendations.

2.7 Feedback data

User feedback are the different ways users interact with the underlying system, and the feed-back data is the recorded interactions that can be used by the recommender system to producerecommendations. The most common type of feedback described in literature is ratings, mostoften 5-star ratings, this is probably because of the Netflix Prize’s prevalence in the field.However there is an uncountable number of different types of feedback and how they arepresented to the user. Feedback can be given by different types of ratings: one-class "likes",binary "thumbs up"/"thumbs down", numeric scales, etc. but also comments, views, time onsite, etc. The feedback must be extracted from the system while considering how users in-teract with the underlying system or application, and what the recommender system tries toachieve. The different types of feedback can be divided into two groups, explicit- and implicitfeedback:

Explicit feedback Explicit feedback is feedback that users directly provide to evaluate anitem; these are most often presented as ratings. Ratings are good because they showthe degree of preference. With explicit ratings, users may feel that they have controlover the recommendations presented to them. However it has been shown that explicitratings can biased towards positive feedback, which can result in misleading ratingpredictions [22].

Implicit feedback Implicit feedback is feedback that users provide indirectly, or converselyfeedback that the system records about the user. There are a few important character-istics to consider about implicit feedback [14]. A user’s implicit feedback for an item isconsidered to be the confidence about user preference as opposed by the degree of userpreference (as with explicit feedback). Implicit feedback is positive-only, it is thereforehard to infer what users did not like. Implicit feedback data is inherently noisy, sincea user’s preference only can be guessed. E.g. it can be that a user is purchasing a giftfor someone else or that a user interacts with something recommended to them thatthey did not like. Implicit feedback is less intrusive for the users than explicit feedbackhowever the feedback data can be very noisy.

10

2.8. Data sparsity

2.8 Data sparsity

Considering the ratings matrix R, it is unlikely that it will be entirely filled with ratings, thatwould mean that all users have rated all items in the dataset. On the contrary it is most oftenthe case in the domain of collaborative filtering that ratings matrices are very sparse. Consideran example: In a movie recommender, it is unlikely that every user have rated every movie.The sparsity of a matrix is defined as the number of zero-valued elements divided by the totalnumber of elements. It is the inverse of the density of the matrix which is the number ofnon-zero elements w divided by the total number of elements. For the ratings matrix R, itssparsity is defined as:

1 ´w

|U | ¨ |I |(2.7)

A less sparse (or more dense) matrix is preferred as it contains more information, makingit easier to produce good recommendations. Depending on the context it may be possibleto make the ratings matrix less sparse by removing users, and maybe even items, with fewratings. The different approaches for making recommendations could be evaluated with dif-ferent datasets with varying sparsity.

2.9 Explaining recommendations

Users are more likely to trust a recommendation when they know the reason as to why a rec-ommendation was made [13]. An explanation can range from telling the user that a presenteditem is a personalized recommendation to in detail explaining the reasoning behind why anindividual item is recommended. Surveys show that the vast majority want to have their rec-ommendations accompanied by an explanation. Recommender systems can be divided intotwo groups, considering their ability to explain recommendations:

Black-box A black-box recommender system is unable, or has difficulty, explaining its rec-ommendations. From a technical perspective, this may stem from limitations in themodel. From a user perspective it means that users are not presented with explanationsas to why an item was recommended.

White-box A white-box recommender system that can, in contrary to black-box recom-mender systems, explain why a particular item was recommended to a user. This en-ables users to justify their trust in the system. Explanations can empower users, makingthem feel in control of the content that is provided to them. If the user can understandwhy an item was recommended to them, they may choose to interact with the system toline themselves with the system reasoning, and effectively improve the recommendersystem.

2.10 Evaluation

Evaluating recommender systems can be done numerically to ensure that the model per-forms well, considering a set goal or hypothesis. It can therefore be advantageous to statethe problem to be solved as to simulate user behaviour, which can be difficult. [1] compre-hensively describes several approaches and considerations to be made while evaluating arecommender system. This offline evaluation can evaluate how well a model can predict andrank items. There are several evaluation metrics that can be used to evaluate recommendersystems. A common one is root mean square error (RMSE), which is a measurement of theerror between the training and testing data after running it through the system. Another eval-uation strategy is measuring the precision of recommendations, measuring the rate at whichthe recommender recommends items in the test set. There are a few commonly used met-rics for measuring precision, two common ones are "Precision at K" (P@K) and "R-Precision".

11

2.10. Evaluation

Offline evaluation can be used when testing different models, to help narrow in on a goodmodel for the recommender system. However to make sure that users appreciate the rec-ommendations in practice, they need to be included in the evaluation process [12]. This canbe done by conducting user studies, to get qualitative information from users. It can alsobe done in online evaluation, actually testing the recommender system on users, preferablywithout them knowing. Online evaluation involves measuring some metric important to thesystem for which recommendations are made; for example the user ratings, click rates, andnumber of purchases. The online evaluation can be done by conducting A/B tests while ex-perimenting with different approaches. A/B testing involves showing certain variations ofa feature to certain users, while measuring some important metric and reason about whichvariation in the test performs the best.

12

3 Method

This chapter is divided into three parts; (i) the explorations and considerations made forthe feedback data used to produce recommendations, (ii) the implementation concerning thedifferent approaches in algorithms and the evaluation framework used to provide and evalu-ation recommendations, (iii) and the considerations made for the evaluation both offline andonline.

3.1 Feedback data exploration

Because of the strong influence of the Netflix challenge and open datasets like MovieLens,much of the published research handles ratings on a numeric scale, most commonly one tofive stars. Another common type of user feedback presented in literature is the implicit feed-back. Neither of these are a perfect match for Fishbrain. Considering "likes" as ratings limitsthe level of preference to one-class, meaning one level, positive-only ratings. Consideringlikes as implicit feedback, the preference is only at a singular degree of confidence that anitem matches the user preference. By definition, likes are not implicit since they are explic-itly expressed by users. Arguments for handling the user feedback (in the form of likes) asexplicit feedback and on the contrary implicit feedback:

Explicit feedback Likes are explicitly expressed by the user. Likes only describe a singularlevel of preference.

Implicit feedback Likes are not explicitly expressed by the user. Likes describe a singularlevel of confidence for preference.

There are widely adopted schemes to mix the two types of feedback, e.g. SVD++ [17]. Thework presented in this thesis does not account for implicit feedback data due to limitationsin the data collection of the underlying system, and also due to scoping the master’s thesiswork to a reasonable level. The work did however involve testing out the different implicitfeedback approaches with the ratings as is, along with the approaches more suited for explicitfeedback.

In comparison to services like Netflix, which can very well recommend movies fromdecades ago, social networking posts are more time sensitive in that users heavily favour

13

3.2. Implementation

Sparse DenseTime frame Users Sparsity Mean Median Users Sparsity Mean Median4 weeks 1.00 99.95 24.69 3 0.66 99.93 37.08 62 weeks 0.64 99.93 18.12 3 0.40 99.89 27.42 61 weeks 0.42 99.89 12.74 2 0.26 99.83 19.44 5

Table 3.1: The number of users for each dataset is relative to the number of users inthe sparse, long time frame dataset. A larger number in mean and median is alsoan indicator of the amount of data for each user, and as stated in previous chapter,more data implies better recommendations.

newer posts over older ones. Picking a ratings dataset from a time frame that can be appreci-ated by the user is therefore a convenient way of limiting the amount of items to be consideredfor recommendation. Picking such a time frame can be sensitive to how users react and arewilling to interact with the system. Studying similar services, e.g. Pinterest and Instagram, itis quite clear that they only recommend fairly new items. Instagram seems to premier postswithin 24 hours, but it can also feature week old posts. Pinterest recommend older posts thanInstagram, somewhere within a couple of months seem to be common (somewhat dependingon which view in the application). Considering that Fishbrain is more, as a service, similar toInstagram, and listening to qualitative feedback from users imply that a smaller time frameis favourable. Users do most of their fishing during weekends, therefore the most interestingposts are posted during, or shortly after, weekends. Therefore it makes sense to limit the timeframe to week cycles.

For every presented feedback dataset, users have rated at least one item. Considering thecold-start problem, these are the users for which the collaborative filtering model will be ableto produce recommendations.

The different approaches have been evaluated against a number of permutations of thedataset of user feedback (Table 3.1). These datasets vary in time frame, and allowing ratingsfor users (and potentially items) with a set lower limit for optimizations for sparsity. Opti-mizing for sparsity can be done by only considering the users (and potentially items) with aminimum number of ratings.

From largest to smallest dataset, the density is more than doubled. However the datasetis still very sparse. Offline evaluation can give an indication to the impact of that change.Other benchmarked datasets have had less sparse datasets, around 94-98 %. The number ofusers for which ratings are present in the dataset decreases almost linearly in relationship tothe density. A larger time frame must be taken into account for making a denser dataset to beable to produce recommendations for a reasonable amount of users.

Options arise in how to handle the recency in recommendations that is hypothesised tobe better for Fishbrain. Biases can be added to the model to take into account users varyingpreference as a function of time. Another option is to consider ratings from a larger timeframe and then during post-processing filter items from a desirable time frame.

3.2 Implementation

Recommendation system model

Different collaborative filtering models were tested, among these the baseline regularized sin-gular value decomposition (RegSVD). Not all models tested were chosen for further explorationand evaluation due to considerations described below. The memory based approaches wasquickly ruled out due to performance issues with larger datasets.

Approaches chosen for offline evaluation:

14

3.2. Implementation

RegSVD (SGD) This regularized SVD approach is a well established approach for identi-fying latent factors, and commonly used for a baseline recommender system. The im-plementation, and the software package used for this approach is straight forward andeasy to use.

NMF/NNMF (ALS) Non-negative matrix factorization with alternating least squares. This typeof matrix factorization is quite similar to SVD, and has also been used for recommendersystems [34]. This approach is implemented in the same software package as the previ-ous entry.

RegSVD (ALS) This approach is from a different software package than the stochastic gra-dient decent approach. It is built for heavy parallelization and enormous datasets. Dueto its implementation complexity, this package is, in comparison to the previously de-scribed RegSVD implementation, less attractive for further evaluation.

BPR Bayesian personalized ranking is a method for implicit datasets that centers around itera-tively sampling positive and negative items and comparing them, i.e. learning to rank[29]. It reportedly optimizes best for a ROC (receiver operating characteristic) curve,which impose some issues with comparing the different results offline. The chosensoftware package implementing this algorithm is easy to use and easy to implement inan existing system.

WARP Weighted Approximate-Rank Pairwise loss is another learning to rank algorithm similarto BPR. It was first introduced by Weston et al. [32]. The authors claim that, in com-parison to BPR, optimizes best for precision as opposed to ROC, or area under curve(AUC). It is implemented in the same software package as the BPR approach.

Post-processing recommendations

Post-processing is done after the recommender has predicted and ranked the items. Due thedifferences in approaches, and software packages chosen for further evaluation, some post-processing is done to keep the presentation uniform. Decisions regarding the post-processinghave mainly been motivated by qualitative user feedback and researching similar services.

• Users’ own catches should not appear in their discover view. Naturally there is noserendipity in presenting a post that a user has posted themself. A discover view couldbe argued to be a kind of top list for like-minded. If users sharing some characteristicswould be presented similar recommendations, it could be considered a feat for a userto have their posts featured in such a view, and a way to get recognition for their con-tributions. However presenting these items is not suitable as subject for personalizedrecommendations.

• Items that users have already seen should typically not be recommended to the user.The data needed to be able to facilitate this type of filtering is the impressions usershave made on the different items. If there are a lot of users exposed to a lot of items,some data structure, like Bloom filter, might be required to be able to handle the amountof data and still perform. Because of limitations in the underlying system, this was notconsidered for implementation. It is also a design choice from a business perspectivehow users want to have it. User interviews or tests may give further insight to whatusers may appreciate in the given domain. Related to the previous statement, itemsshould not be recommended multiple times. The recommender system could triviallyimplement such a filter to make sure that items are not recommended multiple times fora user. However considering the relatively short time frame set for recommendationsin this domain, it is unlikely that an item would be recommended multiple times. Tosummarize, the problem can be funneled into:

15

3.2. Implementation

1. Not showing items that users have seen.

2. Not showing items that users have seen in the discover view.

3. Not showing items that users have explicitly interacted with.

These different considerations for post-processing is a filter on the ordered list of recom-mendations output from the prediction and ranking step of the recommender system (3.1).

trecommendationsu = tranked predictionsu ´ tuser ownedu ´ tuser interactedu (3.1)

Since the implementation is limited to recommendations generated offline, the recommen-dations will, for each user, be the same for a short time period. However, the post-processingdescribed will be applied online at the time of request.

The follower/followee paradigm also opens the question if the recommender systemshould present items from users that are followed by the users. Depending on how the ap-plication is structured, users may already have seen the items in their feed, or they may havemissed it, due to information overflow. The filtering out of items from users followed wasnot implemented since the hypothesis was that users may miss out on items from users theyfollow because of the non-chronological feed and also because of mixed feedback from qual-itative user feedback.

Infrastructure

There are many free and open-source software packages available to produce recommen-dations. Many of these are created by researchers, implementing a number of different al-gorithms and approaches, some are created by open source communities, and a few arecommercial. Amongst the reviewed software packages, many of the packages used in re-search are not suitable for production environments, due to various reasons, examples beingperformance and interoperability (especially from the company point of view). There area few seemingly robust software packages more suited for production environments, how-ever these can lack in flexibility of implementation. For the implementation of the evaluationframework covered in this report, ease of implementation and use is premiered.

A number of packages was evaluated offline but was excluded due to complexity of im-plementation, dead projects, company decisions, etc. Two packages were chosen for furtherevaluation, together they implement three different approaches to recommendations whichwas chosen for further evaluation. Here is a list of evaluated software packages, with shortdescription (in alphabetic order):

Apache Mahout [3] A collection of machine learning algorithms that can run on various dis-tributed computation software.

Apache Spark (MLlib) [23] An engine for large-scale data processing, also bundling ma-chine learning algorithms and amongst them a couple of collaborative filtering algo-rithms.

LensKit [9] Toolkit authored by Michael Ekstrand during his doctoral studies, in an attemptto help researchers and developers of recommender systems to reduce the experimen-tation needed when developing recommender systems.

lightfm [20] Software package to produce recommendations with an approach called "Learn-ing to rank".

mlpack [7] Machine learning library in C++ bundling various machine learning algorithms,therein also matrix factorization algorithms.

16

3.2. Implementation

Figure 3.1: Diagram of the recommender system infrastructure. Arrows indicatedependency, not flow of information.

Scikit-learn [27] A Python toolkit bundling various algorithms in machine learning, it canbe used to create recommender systems with their matrix factorization facilities, etc.

To produce recommendations, an isolated and simple-to-use component of the entire in-frastructure of the application was developed for providing the existing application withfacilities for experimenting with recommender systems. An interface was developed withredundancies to be able to fallback to providing items even though the service would beentirely offline.

Since online testing with real users is important for evaluation, the service needs to beable to provide recommendations from different providers. Since these are designed differ-ently, in programming language, and overall architecture they must be able to run uniformlyagainst a shared interface. For simplicity plain text was chosen as interface for its many ben-efits considering virtually all programming languages can read and write from files and/orstandard input streams.

The providers to be tested are each run with a single runner wrapping the providers. Inputfor the individual provider runners is user feedback and output is recommendations. This isan offline process that can be run continuously or a few times a day, filling a data store. Thestore can then be queried by other applications in the infrastructure for recommendations bya single internal interface.

Figure 3.1 illustrates the recommender system infrastructure. The variable number ofproviders to be tested are individually and independently wrapped by the runner. The inputfor the individual provider runners is the user feedback and output consists of all the recom-mendations. This is an offline process that can be run continuously or several times per day,filling a data store. The store can then be queried by other applications in the underlying in-frastructure for recommendations through a single internal network interface. The applicationrepresents the underlying system of which the recommender system is part. The store is apersistent database caching the recommendations for online querying.

The different providers are then A/B tested with the clients querying for recommenda-tions from a specified provider determined by the variation assigned to them in the test. Thesolution can be extended to conduct further experiments with different sets of providers.

17

3.3. Evaluation in stages

3.3 Evaluation in stages

Offline

The software packages presented, and their provided algorithms was tested offline with thedatasets presented in previous sections. The algorithms have different evaluation methods,so the different solutions are sometimes difficult to compare. The actual results are presentedin the following Results chapter. The main metrics that was used as foundation for makingdecision of which approaches to continue evaluating was RMSE and P@K.

Online

The online evaluation was conducted as an A/B-test, where the control, or baseline of thetest is the current feature in the application. The current feature is, as previously described,presenting the users with most recent catches globally. The variations in the experiment arethe different algorithms (from the different providers), presenting recommendations of items.

Although the feature could benefit from changes in user interface (more screen real estate,further explanations, etc.), the initial experiment changes little in terms of how items arepresented to the user. This is to make sure that the recommendations are tested on their ownmerits, as compared to the older, already established feature. If successful, the evaluationframework for the recommendations could be used in the evaluation for different approachesexclusively making recommendations.

(a) Android (b) iOS

Figure 3.2: Recommendations as presented on the two different platforms.

As shown in figure 3.2, the experiment changes only; the title describing the feature from"Latest catches" to "Catches you may like" and the items presented. The hypothesis for theexperiment is that users presented with the personalized recommendations will interact morewith the items than users in the control group. The main disadvantage of this approach is

18

3.3. Evaluation in stages

that users familiar with the application may not notice the difference from the older Discoverview. If they already have dismissed the older "Latest catches" view, they may never visit thenewer view and/or notice the difference. The experiment would not suffer this limitationif all variations (including control) would be recommendations. Then changes to the viewwould be less sensitive to changes in user interface, etc. This should be considered whileevaluating the recommender system further since explaining the feature further could helpbuild trust with the user, and have them accepting the feature.

Also apparent from figure 3.2 is the difference in user interface for the two different plat-forms. The Android application features meta data, in the form of number of likes and com-ments, etc. Consider the comparison to latest catches, it is likely that items that are just caught(i.e. items featured in "Latest catches") have close to zero user feedback. This might affect theresults, due to peoples likeliness to interact with more or less popular items.

The online evaluation measures success based on three different business metrics thatrelates to the aim of the thesis:

Like Likes on posts are a low-threshold indication of user engagement and also an indicationof preference. Likes are an indication of positive feedback whereas comments might benegative. Increasing this metric can also, as a side effect, help improve the recommendersystem since they are the primary feedback data.

Comment Comparing to likes; comments are more qualitative than likes and can rangeacross all types of sentiment and indicates a deeper engagement.

Follow Users following each other are an important aspect of the work. This is a strongindication of preference since a user, from a recommended post, commits to followinganother user.

To be able to see significant results in the changes in conversion rates of the enumeratedmetrics, sample size and the amount of change needs to be considered to be able to draw astatistically significant conclusion.

19

4 Results

The presentation of results are divided into four sections; decisions made for the datasets, theevaluation framework, the offline evaluation with a number of approaches, and the onlineevaluation with a subset of those approaches.

4.1 Dataset

The dataset parameters (time frame, sparsity, etc.) ultimately chosen for further evaluationwere determined by the offline evaluation results, described in the following section. Becauseof the slight difference in performance between the approaches for each dataset, considera-tions about the user experience were also taken into consideration. Users favour more recentthan older posts in their feed. This is determined by studying similar services and qualita-tive input from users. The solution also respects week cycles so that feedback and items aresampled from every day of the week.

4.2 Evaluation framework

An evaluation framework was developed as described in chapter 3. It runs continuously of-fline filling a data store. These recommendations can then be requested by the applicationand presented to the users. The framework also enables developers to experiment with rec-ommender systems and continuously improve recommendations for users. This is done byextending the framework with providers that can be virtually any thinkable approach as longas it is able to communicate with the common text interface.

4.3 Offline evaluation

The offline evaluation was conducted on multiple datasets. As previously mentioned, theseresults had an impact on choosing which feedback dataset to consider when making rec-ommendations. There are some evaluation metrics that are not applicable for the differentapproaches due to fundamental differences in algorithms and implementation. For preci-sion, measuring the precision at 10 (P@10) makes sense because that is the first page servedto the application when users browse to the Discover page. The RMSE gives an indication of

20

4.4. Online evaluation

the numerical error for the model. The results for the offline evaluation is presented in Table4.1.

Sparse DenseP@10 RMSE P@10 RMSE

RegSVD (SGD) 0.0020 0.0660 0.026 0.0505RegSVD (ALS) 0.0010 0.0304 0.012 0.0257BPR 0.0201 - 0.0231 -WARP 0.0225 - 0.0246 -

Table 4.1: Table presenting the offline evaluation results for the different approachestested on two different datasets. Hyphenated cells are missing measurements forthe given approach. Cells with bold numbers illustrates the argued best result. ForP@10, the number is a fraction of the recommended items that are in the test set. ForRMSE, the number is the numerical error for the predicted ratings.

The results show that the WARP approach has the best precision, which is argued themost important metric since it better describes user behaviour than RMSE. The results forRegSVD did show low (and sometimes noisy) precision, while being able to produce a smallRMSE in comparison to benchmarked datasets. For these reasons both RegSVD and WARPwere chosen for further evaluation.

4.4 Online evaluation

Results for the online evaluation are presented as the percentage change from the controlgroup for important metrics in table 4.2. They are first presented for metrics measured for allusers globally, then also for users in the USA and Sweden. The results are split up in resultsfor the different platforms (Android and iOS). This is because they (and their users) differ inseveral ways. The distribution of iOS and Android users vary depending on markets. Forthe USA, the distribution is close to fifty-fifty when comparing the number of Android andiPhone users, however in Brazil, Android make up the absolute majority of the market. Theapplication also differs in overall user interface and user experience, even though the featuresare close to identical.

Android (Global)Comment Follow Like

RegSVD (SGD) 2.95* 0.70 0.63WARP 2.33* 1.13 -0.15

iOS (Global)Comment Follow Like

RegSVD (SGD) -0.67 -0.21 -0.24WARP 0.13 1.28* 0.02

Table 4.2: Online evaluation results for users globally. The numbers in the cells rep-resent the percentage (%) change in conversion rates measured. Cells with bold textindicate changes with a reasonable probability level (* 80 %, ** 95 %), to determinethe results as statistically significant.

Results are determined statistically significant based on the change in conversion ratesand sample size. Without statistical significance, there is no way of telling that the results aretrue. The metrics are numbers measured holistically for the entire application. This is goodbecause it measures the impact of having this feature in the application. However it may have

21

4.4. Online evaluation

difficulty showing any significant results if the usage of the feature is low in comparison tothe rest of the application or if there is granular differences between the different variations.

Android (USA)Comment Follow Like

RegSVD (SGD) 3.90** 0.89 -0.31WARP 3.09* 1.66 0.32

iOS (USA)Comment Follow Like

RegSVD (SGD) 0.07 0.88 0.49WARP 0.08 1.48* 0.38

Table 4.3: Online evaluation results for users in the USA.

Table 4.3 shows the results for users in the USA. Notice that the results correspond withthe global results. Also keep in mind that the majority of users are from the USA.

Android (Sweden)Comment Follow Like

RegSVD (SGD) 20.59* 8.21 8.32WARP 10.85 27.98* 25.73**

iOS (Sweden)Comment Follow Like

RegSVD (SGD) -6.30 -4.19 -4.69WARP -0.27 -1.91 -9.77*

Table 4.4: Online evaluation results for users in Sweden.

Results for users in Sweden are presented in table 4.4. Notice that the conversion ratesneed a larger change to show statistical significance. This is due to the sample size beingsmaller than for example users in the USA.

All three tables presenting results for the online evaluation show similar trends of differ-ences between the two platforms (Android and iOS).

22

5 Discussion

5.1 Results

The offline evaluation results were almost consistently worse for each approach used in com-parison to benchmarks presented in papers and scientific surveys for datasets with similaruser feedback and sparsity. Different datasets describe different users and how they interactwith the application from which the data is collected. This makes it hard to compare datasetsby benchmarking, to get a sense of reasonable results. Datasets used for comparisons in thesebenchmarks were presented in various research papers, the most common ones, as describedbefore, handle five star ratings, and items are movies.

The offline evaluation was conducted without a validation set, which means that the algo-rithm could be optimized to a point where it overfits the test dataset. Regularization shouldmake it less likely to overfit, however it is not possible to know for sure without validatingthe data.

The decision for measuring the online evaluation metrics holistically for the entire appli-cation was made both because of implementation limitations, and because of the difference inbaseline and variations. The experimentation was not only to compare different algorithmsfor making recommendations, but also to have recommendations instead of listing the latestcatches globally. Since a small part of the measured metrics is recorded in the discover viewit is difficult to know for sure that changes to the feature were the cause for an increase ordecrease in recorded metrics, or noise (or at least the level of noise).

It is difficult at this stage to draw conclusions on the differing results concerning the coun-try and platform since the results are measured holistically across the application. For Swe-den especially, the results are more likely to be due to chance because of the comparativelysmall sample size. One significant change in the user interface between the platforms may bean interesting piece of knowledge. For Android, the users are presented with meta data forthe catches (likes and comments, etc.). This leads to the conclusion that users are more likelyto interact with an item that other users have already interacted with. For "Latest catches" it islikely that the comments- and like counters are zero. The catch may therefore be consideredworse from a user perspective and thus the user may feel less inclined, and be less likely, togive feedback.

23

5.2. Method

5.2 Method

It became apparent while reading the literature that, in order to measure success, the rec-ommender system needs to be evaluated thoroughly online with real users. This evaluationtakes time, as data needs to be collected about the usage. A part of the thesis work was spentin developing the framework and evaluating the different recommender systems. There havebeen attempts to develop frameworks for this purpose previously (e.g. [9]), as described inchapter 3.2. However these were not mature enough for use in this thesis.

There is no novel approach for the actual model or algorithm for making recommen-dations presented. The algorithms were chosen from existing software packages. This en-abled focusing on the development of a framework for experimenting and evaluating recom-mender systems. It is hard to definitively say if, or rather by how much, a better developedmodel for making recommendations would perform. However, since the framework nowexist and it is easy to extend, further efforts can be put into experimentation. Many softwarepackages and approaches were tested to make sure that they fit well with Fishbrain. Research,trying out different solutions, implementing, and evaluating recommender systems, are allimportant and time consuming to implement.

There was a decision to be taken if rolling out the recommendations as a brand new featureor rather to develop an alternative new feature. Changes to the user interface need to becarefully considered to make sure that the experiments are set up in a way that the data istrusted and interpreted. The decision to roll out the feature in a careful way, keeping the oldfeature with latest catches as a baseline, was decided upon since it lines better with how thecompany conduct experiments (with A/B tests) and develop features in the application ingeneral.

For the online evaluation, focus was put on evaluating the metrics while having usersinteract with the system rather than to conduct user interviews. Some qualitative feedbackwas collected from users getting recommendations. However that was not the focus of thethesis.

The A/B test was rolled out evenly distributed against all users. Another approach wouldhave been to conduct the test only with users new to the application. This way, the users haveno preconceived notion about the application and thus the test is fair. However this approachwould have drastically affected the sample sizes negatively.

5.3 The work in a wider context

The term filter bubble was coined in 2011 by Eli Pariser in the eponymous book [25]. It isthe phenomenon when personalized searches, recommendations, and different website algo-rithms are guessing and showing users what they want to see. Thus effectively not exposingusers to viewpoints different than their own. Many companies, in particular Facebook andGoogle, have been heavily criticized for enabling this phenomenon in searches and recom-mendations. Especially in the aftermath of political elections and discourse during 2016 and2017.

Nguyen et al. published a paper in 2014 [24] exploring the filter bubble phenomenon,and its longitudinal impacts, in recommender systems for movies. They found evidence thatrecommender systems do recommend a slightly narrowing set of movies, however also thatusers consuming recommended movies experience lessened narrowing effects.

24

6 Conclusion

6.1 Research questions

1. How does one evaluate different approaches to recommender systems, and more specificallydifferent models; offline while developing, and online with real users?The first research question was answered by describing and following common stepsdescribed in literature. A system for presenting and evaluating recommender systemsin the existing infrastructure of the company has been delivered.

2. How well do traditional recommender systems based on collaborative filtering perform whenhaving only one-class, positive ratings as opposed to other types of ratings five-star ratings, andimplicit feedback?The second research question has been answered by showing worse results than bench-marks. Approaches commonly used for both explicit and implicit feedback has beentested with the one-class ratings, and results have been presented. They were consis-tently (at various degrees), worse than benchmarks. Further work must be made tocustomize the actual recommendation model for one-class ratings. Approaches havebeen presented in research that potentially perform better for these types of ratings.However these were not extensively explored because of time and implementation lim-itations.

3. Which user characteristics needs to be taken into account when developing a recommender sys-tem? How do different users respond to recommendations?The third research question was answered by benchmarking the results with variousdatasets, showing different results depending on country and platform with some ex-planations. However further work is needed to draw a definitive conclusion.

4. Can the recommender system help increase the use of the Fishbrain application?The fourth research questioning has yet to be answered conclusively. The results showincrease in user engagement concerning some of the metrics presented. However thereare discrepancies in sample size and overall interpretation that needs further work.

25

6.2. Aim

6.2 Aim

The aim of the master’s thesis has been to improve the user experience in the applicationby providing personalized recommendations. Many other companies rely heavily on wellperforming recommender systems. As the number of users and items grow, the problem ofproducing relevant content for users become vital to the business.

Fishbrain have features in the application that helps provide users with interesting con-tent. However they are heavily based on direct input. They are based on the entities usersexplicitly follow, what area of a map users are currently browsing, etc. Users do not alwaysknow what they are looking. Provided a massive number of items, it becomes unreasonableto manually go through, and find interesting content on their own. Because of this, personal-ized recommendations are an important feature for Fishbrain as they grow.

6.3 Future work

As previously discussed, there are no novel algorithm or model presented in this report formaking recommendations. A framework for developing and evaluating this now exists andis ready to be used for further improving the approach for making recommendations. Notconsidering various limitations in development time and underlying limitations in the sys-tem. There are uncountable ways of improving the recommendations, not just only the modelfor recommendations, but also changes in user interface; concerning explanations and pro-ducing online recommendations.

26

Bibliography

[1] Charu C. Aggarwal. “Evaluating Recommender Systems”. In: Recommender Systems:The Textbook. Cham: Springer International Publishing, 2016, pp. 225–254. ISBN: 978-3-319-29659-3. DOI: 10.1007/978-3-319-29659-3_7. URL: http://dx.doi.org/10.1007/978-3-319-29659-3_7.

[2] Xavier Amatriain and Justin Basilico. “Recommender Systems in Industry: A NetflixCase Study”. In: Recommender Systems Handbook. Ed. by Francesco Ricci, Lior Rokach,and Bracha Shapira. Boston, MA: Springer US, 2015, pp. 385–419. ISBN: 978-1-4899-7637-6. DOI: 10.1007/978-1-4899-7637-6_11. URL: http://dx.doi.org/10.1007/978-1-4899-7637-6_11.

[3] Apache Mahout. 2017. URL: https://mahout.apache.org.

[4] Erik Bernhardsson. Music recommendations at Spotify. 2013.

[5] John S. Breese, David Heckerman, and Carl Kadie. “Empirical Analysis of PredictiveAlgorithms for Collaborative Filtering”. In: Proceedings of the Fourteenth Conference onUncertainty in Artificial Intelligence. UAI’98. Madison, Wisconsin: Morgan KaufmannPublishers Inc., 1998, pp. 43–52. ISBN: 1-55860-555-X. URL: http://dl.acm.org/citation.cfm?id=2074094.2074100.

[6] Paul Covington, Jay Adams, and Emre Sargin. “Deep Neural Networks for YouTubeRecommendations”. In: Proceedings of the 10th ACM Conference on Recommender Systems.RecSys ’16. Boston, Massachusetts, USA: ACM, 2016, pp. 191–198. ISBN: 978-1-4503-4035-9. DOI: 10.1145/2959100.2959190. URL: http://doi.acm.org.e.bibl.liu.se/10.1145/2959100.2959190.

[7] Ryan R. Curtin, James R. Cline, Neil P. Slagle, William B. March, P. Ram, Nishant A.Mehta, and Alexander G. Gray. “mlpack: A Scalable C++ Machine Learning Library”.In: Journal of Machine Learning Research 14 (2013), pp. 801–805.

[8] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, UllasGargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, and Dasarathi Sampath.“The YouTube Video Recommendation System”. In: Proceedings of the Fourth ACM Con-ference on Recommender Systems. RecSys ’10. Barcelona, Spain: ACM, 2010, pp. 293–296.ISBN: 978-1-60558-906-0. DOI: 10.1145/1864708.1864770. URL: http://doi.acm.org/10.1145/1864708.1864770.

27

Bibliography

[9] Michael D. Ekstrand. “Towards Recommender Engineering: Tools and Experiments inRecommender Differences”. Ph.D Thesis. Minneapolis, MN: University of Minnesota,July 2014. URL: http://md.ekstrandom.net/research/thesis/.

[10] Instagram Engineering. Trending on Instagram. 2015. URL: https://engineering.instagram.com/trending-on-instagram-b749450e6d93.

[11] Carlos A. Gomez-Uribe and Neil Hunt. “The Netflix Recommender System: Algo-rithms, Business Value, and Innovation”. In: ACM Trans. Manage. Inf. Syst. 6.4 (Dec.2015), 13:1–13:19. ISSN: 2158-656X. DOI: 10.1145/2843948. URL: http://doi.acm.org/10.1145/2843948.

[12] Asela Gunawardana and Guy Shani. “Evaluating Recommender Systems”. In: Recom-mender Systems Handbook. Ed. by Francesco Ricci, Lior Rokach, and Bracha Shapira.Boston, MA: Springer US, 2015, pp. 265–308. ISBN: 978-1-4899-7637-6. DOI: 10.1007/978-1-4899-7637-6_8. URL: http://dx.doi.org/10.1007/978-1-4899-7637-6_8.

[13] Jonathan L. Herlocker, Joseph A. Konstan, and John Riedl. “Explaining CollaborativeFiltering Recommendations”. In: Proceedings of the 2000 ACM Conference on ComputerSupported Cooperative Work. CSCW ’00. Philadelphia, Pennsylvania, USA: ACM, 2000,pp. 241–250. ISBN: 1-58113-222-0. DOI: 10.1145/358916.358995. URL: http://doi.acm.org/10.1145/358916.358995.

[14] Yifan Hu, Yehuda Koren, and Chris Volinsky. “Collaborative Filtering for Implicit Feed-back Datasets”. In: Proceedings of the 2008 Eighth IEEE International Conference on DataMining. ICDM ’08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 263–272.ISBN: 978-0-7695-3502-9. DOI: 10.1109/ICDM.2008.22. URL: http://dx.doi.org/10.1109/ICDM.2008.22.

[15] Dmitry Kislyuk, Yuchen Liu, David C. Liu, Eric Tzeng, and Yushi Jing. “Human Cura-tion and Convnets: Powering Item-to-Item Recommendations on Pinterest”. In: CoRRabs/1511.04003 (2015). URL: http://arxiv.org/abs/1511.04003.

[16] Irena Koprinska and Kalina Yacef. “People-to-People Reciprocal Recommenders”. In:Recommender Systems Handbook. Ed. by Francesco Ricci, Lior Rokach, and BrachaShapira. Boston, MA: Springer US, 2015, pp. 545–567. ISBN: 978-1-4899-7637-6. DOI:10.1007/978-1-4899-7637-6_16. URL: http://dx.doi.org/10.1007/978-1-4899-7637-6_16.

[17] Yehuda Koren. “Factorization Meets the Neighborhood: A Multifaceted CollaborativeFiltering Model”. In: Proceedings of the 14th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. KDD ’08. Las Vegas, Nevada, USA: ACM, 2008,pp. 426–434. ISBN: 978-1-60558-193-4. DOI: 10.1145/1401890.1401944. URL: http://doi.acm.org.e.bibl.liu.se/10.1145/1401890.1401944.

[18] Yehuda Koren. The BellKor Solution to the Netflix Grand Prize. 2009.

[19] Yehuda Koren, Robert Bell, and Chris Volinsky. “Matrix Factorization Techniques forRecommender Systems”. In: Computer 42.8 (Aug. 2009), pp. 30–37. ISSN: 0018-9162. DOI:10.1109/MC.2009.263. URL: http://dx.doi.org/10.1109/MC.2009.263.

[20] Maciej Kula. “Metadata Embeddings for User and Item Cold-start Recommendations”.In: Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender Systemsco-located with 9th ACM Conference on Recommender Systems (RecSys 2015), Vienna, Aus-tria, September 16-20, 2015. Ed. by Toine Bogers and Marijn Koolen. Vol. 1448. CEURWorkshop Proceedings. CEUR-WS.org, 2015, pp. 14–21. URL: http://ceur- ws.org/Vol-1448/paper4.pdf.

28

Bibliography

[21] Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro. “Content-based Recom-mender Systems: State of the Art and Trends”. In: Recommender Systems Handbook. Ed.by Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor. Boston, MA:Springer US, 2011, pp. 73–105. ISBN: 978-0-387-85820-3. DOI: 10.1007/978-0-387-85820-3_3. URL: http://dx.doi.org/10.1007/978-0-387-85820-3_3.

[22] Benjamin M. Marlin and Richard S. Zemel. “Collaborative Prediction and Ranking withNon-random Missing Data”. In: Proceedings of the Third ACM Conference on RecommenderSystems. RecSys ’09. New York, New York, USA: ACM, 2009, pp. 5–12. ISBN: 978-1-60558-435-5. DOI: 10.1145/1639714.1639717. URL: http://doi.acm.org/10.1145/1639714.1639717.

[23] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman,Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, ReynoldXin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. “MLlib:Machine Learning in Apache Spark”. In: J. Mach. Learn. Res. 17.1 (Jan. 2016), pp. 1235–1241. ISSN: 1532-4435. URL: http://dl.acm.org/citation.cfm?id=2946645.2946679.

[24] Tien T. Nguyen, Pik-Mai Hui, F. Maxwell Harper, Loren Terveen, and Joseph A. Kon-stan. “Exploring the Filter Bubble: The Effect of Using Recommender Systems on Con-tent Diversity”. In: Proceedings of the 23rd International Conference on World Wide Web.WWW ’14. Seoul, Korea: ACM, 2014, pp. 677–686. ISBN: 978-1-4503-2744-2. DOI: 10.1145/2566486.2568012. URL: http://doi.acm.org/10.1145/2566486.2568012.

[25] Eli Pariser. The Filter Bubble: What the Internet Is Hiding from You. The Penguin Group,2011. ISBN: 9781594203008.

[26] Arkadiusz Paterek. “Improving regularized singular value decomposition for collabo-rative filtering”. In: Proceedings of KDD Cup and Workshop (2007).

[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.Brucher, M. Perrot, and E. Duchesnay. “Scikit-learn: Machine Learning in Python”. In:Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[28] Anand Rajaraman and Jeffrey David Ullman. Mining of Massive Datasets. New York,NY, USA: Cambridge University Press, 2014. ISBN: 978-1-10707-723-2.

[29] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.“BPR: Bayesian Personalized Ranking from Implicit Feedback”. In: Proceedings of theTwenty-Fifth Conference on Uncertainty in Artificial Intelligence. UAI ’09. Montreal, Que-bec, Canada: AUAI Press, 2009, pp. 452–461. ISBN: 978-0-9749039-5-8. URL: http://dl.acm.org/citation.cfm?id=1795114.1795167.

[30] Francesco Ricci, Lior Rokach, and Bracha Shapira. “Recommender Systems: Introduc-tion and Challenges”. In: Recommender Systems Handbook. Ed. by Francesco Ricci, LiorRokach, and Bracha Shapira. Boston, MA: Springer US, 2015, pp. 1–34. ISBN: 978-1-4899-7637-6. DOI: 10.1007/978-1-4899-7637-6_1. URL: http://dx.doi.org/10.1007/978-1-4899-7637-6_1.

[31] Saúl Vargas, Craig Macdonald, and Iadh Ounis. “Analysing Compression Techniquesfor In-Memory Collaborative Filtering”. In: Poster Proceedings of the 9th ACM Conferenceon Recommender Systems, RecSys 2015, Vienna, Austria, September 16, 2015. 2015. URL:http://ceur-ws.org/Vol-1441/recsys2015_poster2.pdf.

[32] Jason Weston, Samy Bengio, and Nicolas Usunier. “Wsabie: Scaling Up To Large Vocab-ulary Image Annotation”. In: Proceedings of the International Joint Conference on ArtificialIntelligence, IJCAI. 2011.

29

Bibliography

[33] Jason Weston, Hector Yee, and Ron J. Weiss. “Learning to Rank Recommendations withthe K-order Statistic Loss”. In: Proceedings of the 7th ACM Conference on RecommenderSystems. RecSys ’13. Hong Kong, China: ACM, 2013, pp. 245–248. ISBN: 978-1-4503-2409-0. DOI: 10.1145/2507157.2507210. URL: http://doi.acm.org/10.1145/2507157.2507210.

[34] M. Wu. “Collaborative Filtering via Ensembles of Matrix Factorizations”. In: KDD Cupand Workshop 2007. Max-Planck-Gesellschaft. Aug. 2007, pp. 43–47.

30

Documents

Implementing a scalable recommender system for social networksliu.diva-portal.org/smash/get/diva2:1117096/FULLTEXT01.pdf · anyone to read, to download, to print out single copies