by Cody Severinski · Many extensions of the baseline Probabilistic Matrix Factorization model have been proposed in the literature, and as expected, all perform better than the baseline

Augmenting Probabilistic Matrix Factorization Models for Rare Users

by

Cody Severinski

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Statistical SciencesUniversity of Toronto

c© Copyright 2016 by Cody Severinski

Abstract

Augmenting Probabilistic Matrix Factorization Models for Rare Users

Cody Severinski

Doctor of Philosophy

Graduate Department of Statistical Sciences

University of Toronto

2016

Recommender systems are used for user preference prediction in a variety of contexts. Most commonly

known for movie suggestion from the Netflix competition, these systems have evolved to cover generic

product recommendation, friend suggestion, and even online dating. Matrix Factorization is a common

model employed for several reasons. Among them, they scale well, are easily learned, and can be adapted

to different contexts.

Many extensions of the baseline Probabilistic Matrix Factorization model have been proposed in the

literature, and as expected, all perform better than the baseline with reported test results. We review

several of these extensions, notably: constraints based on similar rating patterns among users, allowing

for nonconstant variance / precision in the model, introducing personal information on the users as

constraints, and including user networks in the model. These models are extended to the Bayesian

framework where necessary. We illustrate how these extensions perform overall, and for sets of users

defined by different number of ratings at training time. In particular, we highlight the benefit of many

of these extensions for infrequent users (those with few or no ratings in the system). This is particularly

important as these users are the most common in the recommendation framework.

In the case of user networks, we additionally study the robustness of the model in the presence of

random links. This reflects the true state of user networks in applications such as Facebook, where social

ties may not convey similar taste in preferences.

In addition, we provide the first direct comparison of the performance of the models learned from

Gibbs sampling and variational inference. Limitations of the variational algorithm are outlined for

multiple models, with proposals given for alleviating overfitting.

ii

Acknowledgements

There are many people I wish to thank, as they have all been crucial to my success:

• My supervisor, Professor Ruslan Salakhutdinov. Your support, guidance, and encouragement have

been critical to this. You have offered support in all of modeling, theory, inference, and coding /

computation issues. I am grateful for all of this;

• My committee members: Professor Nancy Reid and Professor Jeff Rosenthal. You both have been

instrumental in guiding me along the way, reviewing work, and offering guidance for the next steps;

• My external: Professor Mu Zhu from the University of Waterloo. You raised some very important

questions in your report, which allowed me to make higher level connections between various

aspects of the thesis work;

• The Chair of the department: Professor James Stafford. You encouraged and supported me with

my first teaching role at the University of Toronto (Statistics 261), assisted with the organization

and planning of campus wide Research Day events, and took a sincere interest in my work and

development during the degree;

• My family, and in particular my parents: Leanne and Gary Severinski. If there are two people in

my life that deserve anything, it is them. Thank you Mom and Dad, for always being a phone call,

train, flight, or transit ride away;

• The Graduate Administrator in my department: Andrea Carter. You have also been key to the

completion of my degree, and to several other aspects of my time at the University of Toronto. You

have ensured that paperwork was completed on time, assisted with logistical / teaching matters

when I served as a course instructor, and coordinated everything with multiple deadlines. Most

importantly, you were always there to talk, and also to listen;

• Two other administrative members in the department: Christine Bulguryemez and Angela Fleury.

During my PhD, I helped organize campus wide Research Day events. You both were there, behind

the scenes, making sure the speaker’s flights, accommodations, and event logistics were arranged

for;

• My personal friend, Professor of Mathematics at UBC, and President of the UBC Faculty Associa-

tion: Professor Mark Mac Lean. You have been supportive, both academically and personally. You

have consistently provided an unbiased estimate of my work and development during my degree.

As a Statistician, I am grateful for your consistent and unbiased estimates;

• A faculty at the University of Toronto: Professor Alison Gibbs. You have been my mentor over

the years in developing my teaching abilities. In many ways, you oversaw the development of my

teaching while my supervisor oversaw the development of my research. Your name as a teaching

reference has been crucial with having secured sessional offers from multiple other universities;

• Fellow PhD: Alex Shestopaloff. You have provided external critiques and reviews of my work over

time, and you were there to help me keep everything in perspective.

• Friends who supported me in one way or another: Eric Peng, Uyen Hoang, Sian Hoe Cheong,

Erwin Alexander Ketterer, Anton Babadjanov, Suzanne Wasmund, Patrick Halina, Theri Kay,

Stefan Attig, Juveria Ghare, and Zita Baryonyx Poon;

iii

• NSERC, for providing funding through the NSERC CGS program.

iv

Contents

1 Introduction 1

1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Hierarchy of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Content Based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Connection to Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Data Sets Used in this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Matrix Factorization 8

2.1 Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Bayesian Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Constrained Bayesian PMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


2.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


2.7.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7.3 Side Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Precision Models for Matrix Factorization 25

3.1 Existing Noise Models for BPMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Truncated Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


v


3.6.1 Variational inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6.3 Truncated Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6.4 Overfitting in the Robust Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.5 Side Features with Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Meta-Constrained Latent User Features 41

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


4.6 User Meta Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6.1 PCA for User Meta Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.7 MAP Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 A Generative Model for User Network Constraints in Matrix Factorization 55

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.1 Pseudo-Generative Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.2 A Special Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60



5.6.1 Pathological Network Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.6.2 Shift Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.6.3 Fake Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Conclusion 75

Appendices 77

A Ancillary Results and Derivations 78

A.1 Squared Error Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

A.1.1 Quadratic with Respect to User Features . . . . . . . . . . . . . . . . . . . . . . . 78

A.1.2 Quadratic with Respect to Item Features . . . . . . . . . . . . . . . . . . . . . . . 79

A.1.3 Quadratic with Respect to Side Features . . . . . . . . . . . . . . . . . . . . . . . . 79

A.2 Expectation of Certain Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.2.1 Expectation of Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

vi

A.2.2 User Quadratic Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A.2.3 Gamma Random Variable Expectation . . . . . . . . . . . . . . . . . . . . . . . . . 81

A.2.4 Wishart Random Variable Expectation . . . . . . . . . . . . . . . . . . . . . . . . 81

B Constrained PMF 82

B.0.5 Conditional Posterior for Side Feature . . . . . . . . . . . . . . . . . . . . . . . . . 82

B.0.6 Conditional Posterior for User Feature . . . . . . . . . . . . . . . . . . . . . . . . . 84

C Distributional form of the Variational Approximation 85

C.1 User Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

C.2 User Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

C.3 Item Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

C.4 Side Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

C.5 Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

C.6 User / Item / Side Feature Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 88

D Derivation of the Variational Lower Bound 90

D.1 Complete Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

D.1.1 Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

D.1.2 User Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

D.1.3 User Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

D.1.4 User Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

D.1.5 User Hyperparamters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

D.2 Etropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

D.2.1 Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

D.2.2 Precision Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

D.2.3 User Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

E Meta Constrained PMF 101

E.1 Meta Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

E.2 User Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

F User Features in Matrix Factorization with User Networks 103

F.1 Sampling Distribution for Ui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

F.1.1 Prior for Ui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

F.1.2 Prior for Sk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

F.1.3 Sampling Distribution for Si . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Bibliography 106

vii

Chapter 1

Introduction

1.1 Framework

When confronted with a limited selection of options, humans are typically able to quickly review and

select a preferred option. They may even be able to rank such options. The simplest such decision would

be a binary one. For instance, if someone is to watch the movie in theatre one or theatre two.

Current technology makes it easy to categorize, store, and present to a user a wealth of possible

choices. There are over 42 million Facebook pages[34] and hundreds of millions of hours on YouTube[41,

27]. This presents new challenges. The most apparent is that the number of options to present to a user

is large. It is not possible for a user to review a representative list of all options. A naive solution would

be to present a subset of the most universally liked options. The presentation of these globally popular

items removes the personalized aspect of the recommendation, while also creating a self-reinforcement

of the popularity of these items.

The goal of a Recommender System is to remove the need for manual discovery of preferred items by

predicting the individual tastes of a user for new and unknown items. A good recommender system will

yield personalized recommendations with high precision. Personalized recommendations are unique to

the user’s interest, and not necessarily globally popular items. Given the large corpus of items typically

available, a good recommender system must also scale to large sets of users and items, and must also be

computationally efficient.

1.2 Hierarchy of Models

A main division in the design of recommender systems is the use of user and item level features. Systems

that extract features from the users and items are content-based recommender systems. Systems that

do not extract features from the users and items are collaborative based recommender systems.

1.2.1 Content Based Systems

Content based recommender systems share characteristics with many supervised learning models: given

a set of features (in this case, features on users and items), extract a collection of features to predict

new preferences for users. These features are also referred to as “domain knowledge”, reflecting the fact

1

Chapter 1. Introduction 2

that the features can change from domain to domain. The features used for music recommendation, for

instance, will be different from the features used for movie recommendation.

To give a concrete example, The Music Genome Project is patented by Pandora Media[8]. The project

is a recommendation system for music. Songs are characterized by approximately 150 features. Typical

features include ”...gender of lead vocalist, level of distortion on the electric guitar, type of background

vocals, etc”. With these features, similarities between songs can be computed, and recommendations

can be made to users based on listening history.

Contrast this to the case of content based recommendation of movies. Typical systems will use

features such as genre, lead actor, director, and other character-based features for recommendation.

This follows from the fact that people tend to follow genres and people with movies. As mentioned,

these features are different from the set of features for music. This difference in domain knowledge can

make cross-domain recommendation difficult in the context of content based recommendation. Different

features, and likely different recommendation systems, will be needed.

1.2.2 Collaborative Filtering

Collaborative filtering methods contrast from content based recommendation systems in that they do

not use domain knowledge of the users and the items. Collaborative filtering methods rely on a user-item

rating matrix. The ratings may be given explicitly by the user or obtained implicitly through indirect

means. Examples of explicitly given ratings would be:

• Movie ratings on Netflix, the Internet Movie Database;

• Video ratings on YouTube;

• Liking a Facebook page;

• Rating a business on Yelp;

While examples of implicitly given ratings are:

• Satisfaction of a streaming movie based on the a threshold of time the user viewed the video;

• Interest in an email campaign for products based on email click-through-rate (CTR);

• Repeated and related search queries for a topic;

The user-item rating matrix is a sparse matrix. The matrix is sparse since most users typically rate

only O(1) or O(10) items. For instance, the median number of ratings for a typical user in the data set

from the Netflix competition is 96 (average is 209), while the median number of likes for a Facebook

page from a prior study was 68 (average is 170) [16].

However, the user-item rating matrix is not uniformly sparse. Some users are much more frequent

than others. A movie critic, for instance, can be expected to rate many more movies than the average

user. The top 10% of users in the Netflix data set rate over 500 movies. Similarly, some Facebook pages

will have many more likes than others. One illustration of this imbalance is the Hillary Clinton page,

which has approximately 1.6 million likes as of October 2015.

The user-item rating matrix is also partially observed. The feedback received from user i for item j

sets the value of the (i, j) entry of the rating matrix to be observed. These entries with explicit feedback

are the observed set of ratings.


Ω =(i, j) | Ri,j 6= 0.

The complement of this set, the cells with zero entries, are properly modeled as missing values in

most contexts. These values are not missing at random, and has been shown to generate a bias in the

observed ratings [23, 24]. The act of providing explicit feedback means that user has an interest in the

item, and this interest is generally more favourable than what would have been observed under random

sampling. Similarly, the exclusion of a rating does not mean the user has no interest in the item. In

many applications, a zero entry for the (i, j) entry means that user i is not aware of item j. The related

problem of recommending a set of items of interest that is also diverse in coverage is a closely related

problem that has been studied[42].

Memory Based CF Systems

Neighbourhood methods are a very common memory based CF system, and are also some of the most

intuitive and well-known [18, 35]. A distance metric is defined for two users (alternatively, two items),

corresponding to two rows (alternatively, two columns) of the user-item rating matrix. When the distance

is defined between two users, they are known as user-user similarity systems. When the distance is defined

between two items, they are known as item-item similarity systems.

To make recommendations in a user-user system, the distance from the target user to all other users

is computed. Other users can be ranked based on distance, and the items these users rated is aggregated.

The items already considered by the target user are removed from the list, and the remaining items can

be ranked for recommendation purposes.

To make recommendations in an item-item system, the distance from items rated by the target user

is computed for other items. The unobserved items can be ranked based on distance to the observed

items, and presented to the user.

Any two given users typically have few ratings in common, since both users rate a fraction of a

percent of items on average. This makes it difficult to compute distance between two users, on average.

Item-item based methods are also prone to this sparsity problem.

Memory-based systems include graph based recommendation algorithms. Common instances are

bipartite graphs of users and items, with directed edges from users to items if a user expressed a rating

for an item. Industry research has been published outlining the use of these methods for YouTube[2].

Memory-based CF filtering systems are fast to train [35]. In particular, user-user and item-item

similarity methods can easily be distributed, allowing the model to expand and maintain fast computation

time by adding more processors. In addition, these memory-based systems are not parametric. In

practice, inference for the parameters in a parametric model is iterative. However, the simplicity of

these systems tend to lead to larger errors than other more complex approaches, discussed below.

Model Based CF Systems

One approach to CF systems is to assume a parametric model that can be learned using historical

data and used to predict future data. Model based systems include regression methods, belief nets,

Boltzmann-machines, LDA-style models, and factor analysis / matrix factorization models. These are

probabilistic models, often placed in a Bayesian framework. In this section, these models are briefly


reviewed. Being the focus of this work, a more detailed discussion of matrix factorization models is in

Chapter 2.

To make recommendations, model based CF systems are first trained to learn model parameters.

Once trained, the learned set of parameters can be used to make predictions of new ratings. Training

for these models is often relatively slow compared to memory based CF systems. In many cases, the

objective cannot be directly solved and iterative approximate inference methods must be used. Two

common methods in the literature are variational methods [22, 19] and Markov Chain Monte Carlo

[31, 10].

Gaussian Matrix Factorization

A common approach is to assume a low-rank approximation to the user-item matrix[7, 4, 1, 30]. The

N ×M rating matrix R is modeled as the product of a K ×N matrix of user features U and a K ×Mmatrix of item features V , such that R = U>V . This approach has an interpretation in terms of latent

user and latent item features. Each user ui is represented by a latent k-vector Ui, the ith column of

the matrix U . Each item vj is represented by a latent k-vector Vj , and the rating given by user ui to

item vj is the inner product U>i Vj . The features Ui, Vj are modeled as samples from i.i.d. Gaussian

distributions.

Ui ∼N (Ui | 0, λUI)

Vj ∼N (Vj | 0, λV I).

The unbounded and real-valued nature of the features implies that the predicted rating is also un-

bounded and real-valued. This is not reality, but is often a good approximation and provides sufficiently

accurate results for industry applications.

The dimensionality of the feature vectors is limited by computing resources. In addition, the per-

formance of these PMF models trained using gradient descent is prone to overfitting as K increases,

which demands more tuning of the parameters λU , λV . The Bayesian extension, discussed further in

Chapter 2, is not susceptible to overfitting as K increases[31].

Hierarchical Poisson Factorization

Recent work has proposed a matrix factorization model where the entries of the rating matrix are

modeled by a Poisson distribution [10]. Similar to Gaussian matrix factorization, the mean of ri,j in

Poisson factorization is modeled by the inner product of a latent user feature θi and a latent item feature

βj . Independent Gamma priors are placed on each user component θi,k and each item component βj,k,

with hyper-priors placed on the rate parameters for each. This model is conjugate, and fast variational

inference methods have been derived.

This model has several properties that are desirable in comparison to the Gaussian matrix factor-

ization model. In brief, the Poisson model better captures sparsity and skewness. In real data sets,

some users / items are best represented by lower-dimensional features than others (corresponding to less

latent features for the users / items), and the Gamma priors encourage this sparsity. The skewness in

the frequency of the number of ratings per user and per item are also modeled more readily by Poisson

matrix factorization, verified by Bayesian posterior predictive checks[10].

Hierarchical Poisson Factorization has been extended to a nonparametric framework by allowing the


dimensionality K of the features θi and βj to have arbitrary support [11]. This nonparametric model was

found to have comparable or superior recall and precision across multiple data sets including MovieLens

and Netflix.

(R)BM Variants

With the rise of interest in deep learning methods, Boltzmann-machine style models have been increas-

ingly used. One proposed model for the Netflix competition data set was based on Restricted Boltzmann

Machines (RBMs)[33]. RBMs are an undirected graphical model with a single layer of visible units and

a single layer of hidden units. There are connections between the hidden and visible layers, but no

connections between hidden units and no connections between visible units. In this approach, each user

in the system was modeled by an RBM, with a set of visible units for the movies the user rated. The

parameters for the connectivity weights and the biases were shared between the users.

Other work has looked at the use of general Boltzmann machines for binary rating prediction[12].

A Boltzmann machine (BM) consists of a single layer of fully connected hidden units. Similar to the

RBM models, each user was represented by a single BM, with the layer of hidden units representing the

set of movies to be recommended. For training purposes, ordinal five star ratings were dichotomized to

correspond to intuitive like / dislike. The weights of the BM were tied across users, and parametrized

by item content. It was shown that the parametrization by item content improved prediction accuracy,

again analogous to the low-rank factorization of the weight matrix in the RBM model referenced above

[33].

Ordinal Models

More complicated models have been developed that take into account the ordinal nature of the ratings.

A latent real-valued variable is generated by these models, which are then digitized into ordinal rankings

based on a partition of the real line[29]. In particular, this model defines for R ranked values, a partition

of the real line

−∞ = b1 < b2 < · · · < bR+1 = +∞,

with a real-valued prediction fi,j for user i on item j corresponding to rank ri,j if br ≤ fi,j < br+1.

Other work has looked at more general constraint-based models[38]. These models also rely on the

concept of having ordinal (or other) ratings generated from latent continuous random variables. These

models are more general than the ordinal matrix factorization models in that they allow for general

constraints on the observations, of which the ordinal inequalities are one. Other possible constraints

include censored and binary observations.

1.2.3 Evaluation Metrics

Earlier recommendation system work focused on using root mean square error (RMSE),

RMSE =

√1

N

∑i∈I

(ri,j − ri,j)2

=

√1

N

∑i∈I

(ri,j − U>i Vj)2,


where I is a set of (i, j) pairs for which the prediction of the ratings ri,j are evaluated. The pre-

dominance of this metric was motivated by its adoption in the Netflix competition as the evaluation.

This metric is convenient as it is convex in Ui and Vj , but often impractical. The drawbacks to this

metric include its ignorance of the ordinal nature of the ratings. Assume for the following that ratings

are given on a 5-star ordinal scale. In practice, it is more important to have high accuracy in predicting

the ratings for items rated as 5 than items rated as 1. In addition, it can be argued from a marketing

viewpoint that it is more costly to predict a true 1 as a 5 than a true 5 as a 1. The latter error hides

one of many desired items from a user, while the former presents an uninterested / unwanted item to

the user. Since the number of items recommended to the user is typically small (typically five or less),

the presentation of a undesired item can taint the user experience.

Other work has evaluated using mean absolute error (MAE),

MAE =1

N

∑i∈I|ri,j − ri,j |.

This metric is closely related to RMSE, except it does not penalize more extreme deviations in

prediction error.

1.3 Connection to Thesis Work

The contributions of this thesis are in extensions to Gaussian Matrix Factorization. Specifically, exten-

sions of a vanilla matrix factorization model are reviewed, modifications proposed to deal with existing

issues, and evaluated for test performance.

At a high level, these extensions incorporate additional information into the prior by modifying the

mean or variance / covariance of the latent (user) features. Specifically:

• Chapter 2 defines and extends Constrained Probabilistic Matrix Factorization to the Bayesian

framework. This extension introduces a new set of latent features, Wk, k = 1, . . . ,M , for the

items. In contrast to the explicitly provided ratings, these features model implicit information

from the user (ex: viewed item, downloaded app, streamed video for more than x minutes, etc.).

These features are then used to shift the expected rating, E[ri,j ];

• Chapter 3 reviews existing work on heteroskedastic, or nonconstant, precision models. Overfitting

issues under variational inference are outlined, and a truncated precision model is proposed to

alleviate this. The performance of the truncated model is outlined, and the model is theoretically

connected to existing work through limiting cases;

• Chapter 4 outlines a model that includes user demographics, or meta features, into the PMF

framework. Similar to Chapter 2, this extension shifts the prior mean of a user feature by an

average of new latent features, each one tied to a demographic. An additional extension using

principal components is outlined and demonstrated to improve prediction;

• Chapter 5 introduces the concept of networks among users. These networks are observed in many

different cases (ex: friendship networks, trust networks), and have been shown to improve the

prediction. Limitations with existing approaches are outlined, and a new model is proposed that

overcomes these issues. The proposed model is demonstrated in the case of fully observed and


partially observed networks. Future work is outlined with respect to the performance of this

model under completely random networks.

The experiments in this thesis were frequently attempted with multiple choices of the rank d. The

results results used two specific choices for d. For chapters 2-3, d = 20 dimensional features were used.

For Chapters 4-5, d = 10 dimensional features were used. Both of these values are commonly used in

the literature [31, 32, 21, 40].

By leaving the extensions (mostly) independent of the specific distributional assumptions placed on

the latent features, the extensions can be adapted to other models. For instance, the user features in a

hierarchical Poisson Factorization model can be shifted in the same manner that user features for the

Gaussian models considered in this thesis are shifted.

1.4 Data Sets Used in this Thesis

This thesis makes use of multiple data sets for experimental results for various models.

The MovieLens 1M data set consists of 1,000,209 (user, item, rating) triplets for 6,040 users on 3,900

movies provided on the MovieLens service in 2000 [14]. The user set is heavily censored as each user

has at least 20 ratings the data. Timestamps are present in the original data set, but are not used

here in this thesis work. Auxiliary information both on the users and items are available. In particular,

for each user, we are given: gender, age (dichotomized into six bins), occupation (dichotomized into 21

categories), and zip code.

The Epinions data set was gathered over a five week crawl of the Epinions site, a product review

site, in 2003. There are 49,290 users and 139,738 items in the data set. In addition, trust statements

are given for 487,181 pairs of users [26, 25]. On this site, users can indicates other users they trust the

reviews of. This forms the trust statements, a directed network.

Flixster is a social networking site for movie fans. The Flixster data set [15] consists of 6,160,927 rat-

ings for 109,218 users on 42,173 items. Users are able to invite other users onto the site and form “friend-

ships”. The data set also contains an undirected social network among users consisting of 1,347,222 links

among pairs of users, as well as limited demographic information for a user (gender, location, and age).

Some of the demographic information is missing or nonsensical.

Ciao (ciao.co.uk) is another product review site where users can form similar trust statements. The

Ciao data set [36, 37] consists of 284,086 ratings from 7,375 users on 106,797 items. For each (user, item,

rating) triplet, there is additionally a category for the product, and a helpfulness score for the rating.

The ratings data is accompanied by a directed network among users with 111,781 edges between users.

FilmTrust is a movie rating / review service. Users in this service can add “friends” and need to

indicate trust values for their “friends”. The FilmTrust data set [13, 9] was generated from a site crawl

in June 2011. There are 35,497 ratings from 1,508 users for 2,071 items. This is the smallest of the data

sets we consider. There is additionally a directed network among users with a total of 1,853 edges.

ciao.co.uk

Chapter 2

Matrix Factorization

Factor based models have been used extensively in the collaborative filtering domain for preference

matching between two sets of objects. In the user recommendation framework, users are one set of

objects, and the other set is some generic collection of items. Common industry applications are videos

(ex: YouTube, Netflix), products (ex: Amazon), mobile apps (ex: Google Play), or other users (ex:

Facebook, LinkedIn, OkCupid). Factor based models are content-less: they require no content extraction

or feature generation from the users and items. The content-less nature of factor based models allows

these models to easily adapt from one item domain to another (videos to music), or even to domains

with multiple item contexts (ex: generic product recommendation). Instead, these models assume there

are a small number of unobserved latent features associated with each user and item that determine

preferences.

Matrix factorization models are a common factor based model. Given N users, M items, and a (user,

item) matrix of preferences R ∈ RN×M , these models approximate R as the product of two low rank

matrices such that R ≈ U>V , where U ∈ Rd×N , V ∈ Rd×M . Each column in U is the d-dimensional

latent feature of a user and each column in V the d-dimensional latent feature of an item. The ri,j entry

can be reconstructed as the inner product U>i Vj , where Ui is the ith column of U , and Vj is the jth

column of V. The problem of estimating U and V can be approached as an incomplete SVD problem:

find the best approximation to the partially observed matrix R given some loss function.

A common probabilistic framework for matrix factorization models is to assume that each user feature

Ui and each item feature Vj are independent samples from some probability distribution, and that the

rating ri,j has a distribution given Ui, Vj , and possibly other parameters. The goal is then to make

inference of the user features Ui and item features Vj in order to make predictions of the ratings /

preferences ri,j . In the Bayesian framework, this inferential problem equates to modeling the posterior

of the user and item features given the observed ratings.

There are two common inferential approaches in the machine learning literature for this problem.

The first is grounded in Monte Carlo methods, and the second is grounded in variational inference.

Monte Carlo methods approximate the true posterior. They are often criticized for slow convergence

and computational complexity. Conversely, variational methods provide exact results to an approximate

problem. They often are favoured for computational simplicity, though the approximations employed

typically rely on the strong assumption of posterior independence among the user and item features. To

our knowledge, work in the literature commonly reports experimental results using only one of Gibbs

8

Chapter 2. Matrix Factorization 9

sampling or variational inference. Further, these results are often reported only for the overall training

and tests sets. There is rarely a discussion of the performance with respect to users with different number

of ratings in the system. Such a distinction is important as inference, and hence prediction, for a user

depends on the amount of ratings in the system for the user.

This chapter contains the first direct comparison of Gibbs sampling and variational inference in the

matrix factorization context. We report our comparative results using overall test error metrics, and

also broken down for subsets of users defined by different frequency in the training set. Our work has

the following contributions:

1. We provide a direction comparison of Gibbs sampling and variational inference for multiple matrix

factorization problems in the recommendation framework;

2. We extend the previous work on constrained probabilistic matrix factorization to the Gibbs frame-

work, highlighting the gains made by Gibbs sampling over the MAP estimate.

2.1 Probabilistic Matrix Factorization

Probabilistic Matrix Factorization (PMF) is a probabilistic linear model which models ri,j as a Gaussian

with mean U>i Vj . In the vanilla PMF model, the baseline for much of the work in this thesis, the

precision (inverse variance) of this Gaussian is a constant τ . Each column of U, V , corresponding to each

user and item, has an independent Gaussian prior placed on it. The conditional distribution over the

observed ratings R and the prior distributions over U and V is given by

(ri,j |Ui, Vj) ∼[N (ri,j | U>i Vj , τ)

]Ii,j(Ui|λU ) ∼N (Ui | 0, λUI)

(Vj |λV ) ∼N (Vj | 0, λV I),

(2.1)

where N (x | µ, τ) denotes the univariate Gaussian distribution for x with mean µ and precision τ ,

and Ii,j ∈ 0, 1 is the indicator that user i provided a rating for item j. Further, N (x|µ,Λ) denotes the

multivariate Gaussian distribution with mean vector µ and precision matrix Λ.

In practice, it is important to model the bias of each user and each item. Let γi denote the bias for

user i, and ηj the bias for item j. The mean of the predicted rating in Equation (2.1) should be properly

modeled by

E[ri,j | γi, ηj , Ui, Vj ] =γi + ηj + U>i Vj . (2.2)

In the probabilistic framework, both γi and ηj can be modeled as univariate Gaussians

(γi | λγ) ∼N (γi | 0, λγ)

(ηj | λη) ∼N (ηj | 0, λη).(2.3)

Inference for this model is performed by maximizing the log-posterior over the latent features and

biases with fixed hyperparameters,


log p(U, V | R, τ, λU , λV )

=

N∑i=1

M∑j=1

Ii,j log p(ri,j | Ui, Vj , τ) +

N∑i=1

log p(U | λU ) +

M∑j=1

log p(V | λV )

+

M∑i=1

log p(γi | λγ) +

N∑j=1

log p(ηj | λη).

(2.4)

2.2 Bayesian Probabilistic Matrix Factorization

Bayesian PMF extends PMF, Section 2.1, to the Bayesian framework. The likelihood of the observed

rating data is the same as in the PMF model, outlined in Equation 2.1. The Bayesian extension places

independent Gaussian priors on the latent features U and V , each with unknown means µU , µV and

precisions ΛU ,ΛV ,

(U | µU ,ΛU ) ∼N∏i=1

N (Ui|µU ,ΛU )

(V | µV ,ΛV ) ∼M∏j=1

N (Vj |µV ,ΛV ).

(2.5)

This is in contrast to the PMF model, where the features are mean zero, with spherical precisions.

This eliminates the need for the tuning parameters λU , λV , which have been shown to need careful tuning

to avoid overfitting [32].

If the biases are to be modeled, independent Gaussian priors for the γi and ηj can be included,

(γ | µγ , λγ) ∼N∏i=1

N (γi | µγ , λγ)

(η | µη, λη) ∼M∏j=1

N (ηj | µη, λη).

(2.6)

This Bayesian extension further places Gaussian-Wishart priors on the latent feature hyperparameters

µU ,ΛU, µV ,ΛV

(µU ,ΛU ) ∼N (µU |µ0, β0ΛU ) · W(ΛU |W0, ν0)

(µV ,ΛV ) ∼N (µV |µ0, β0ΛV ) · W(ΛV |W0, ν0),(2.7)

where Λ ∼ W(Λ|W0, ν0) denotes a random variable Λ drawn from a Wishart distribution with ν0

degrees of freedom and scale matrix W0.

The graphical model for Bayesian PMF is in Figure 2.1.

2.3 Constrained Bayesian PMF

Learning the baseline model in Section 2.2 improves the recommendation process over the baseline of

prediction at random. However, the improvement is not uniform over users. The learned model has the

most benefit over predicting at random for users with many ratings in the system. Users with few ratings


Figure 2.1: Bayesian Constrained Probabilistic Matrix Factorization with Gaussian-Wishart Priors overthe latent user, item, and side feature vectors.

ri,j

i = 1 : N

µU

ΛU

Uiµγ

λγ

γi

j = 1 : M

µV

ΛV

Vj

µη

ληηj

k = 1 : N

Wk

µW

ΛW(µ0, β0)

(W0, ν0)

τ

aτ bτ

in the system have little contribution to the features and biases from the likelihood, and so learning will

be dominated by the prior.

This notion can be formalized mathematically by considering the posterior expectation of a rating

ri,j , denoted Eposterior[ri,j ]. If there is little influence from the data, then the expected value of the

features and biases under the posterior, Eposterior, will be close to the expected value under the prior,

Eprior. Propagating this to the expected rating, we have

Eposterior[ri,j ] =Eposterior[γi + ηj + U>i Vj ]

=Eposterior[γi] + Eposterior[ηj ] + Eposterior[U>i Vj ]

≈Eprior[γi] + Eposterior[ηj ] + Eprior[Ui]>Eposterior[Vj ]

=µγ + Eposterior[ηj ] + µ>UEposterior[Vj ].

(2.8)

The first equality is from definition, the second equality is from linearity, and the subsequent approx-

imation is based on the effect of the likelihood. The final equality is from the definition of the Bayesian

model, and highlights a problem for rare users and cold start users. The first term is the regularization

from the prior, and the third is multiplicative in this regularization from the prior. Only the second

term contains signal from the likelihood, and is driven purely by the item. This means that predictions

for cold start and rare users are not personalized. These recommendations are driven by global item

information.

One way of constraining the predictions for these users is to model implicit feedback from the users.

In contrast to explicitly provided feedback (ie.: users rating items on a numeric scale), implicit feedback

can take many forms. For instance:

• Repeatedly viewing a product page


• Downloading an app

• Streaming a video for more than a given amount of time

In each of these cases, the action taken by the user is implicit feedback of interest [32]. To model

this feedback, let W ∈ Rd×M be a set of such constraint features, or side features. Denote each column

of the matrix as Wk ∈ Rd. Further, introduce the indicator matrix I

I =

I1,1 I1,2 · · · I1,M

I1,2 I2,2 · · · I2,M...

.... . .

...

IN,1 IN,2 · · · IN,M ,

(2.9)

where Ii,j = 1 if user i has provided implicit feedback on item j, and 0 otherwise.

With this in place, define the shifted user feature as

Si =δUUi + δW

∑Mk=1 Ii,kWk∑M`=1 Ii,`

. (2.10)

We explicitly include indicator functions δU , δW ∈ 0, 1 to ease the interpretation of the derivations

that follow. When δW = 0, the model includes only the user specific latent features Ui and reduces to

Bayesian PMF of Section 2.2. The case δU = 0 is an interesting case where there are no user features

Ui, only the shared side features Wk.

The normalization of the second term in Equation (2.10) by∑M`=1 Ii,` ensures that the magnitude of

the shift is independent of the number of items with feedback. For notational convenience, we will let

ni =∑M`=1 Ii,` denote the number of items with feedback from user i.

The intuition behind Equation (2.10) is related to the concept of taste similarity [15]. Intuitively,

Wk captures the effect on the prior mean from a user having rated (or simply viewed) a particular

item. Therefore, users with similar rating (or viewing) habits will have similar prior distributions for

the feature vectors. This is formalized in the derivation of the posterior mean for Ui in Appendix B.

In the probabilistic framework, the Wk are regularized by the same spherical zero mean Gaussian

prior as the other latent features

(Wk | λW ) ∼N (Wk | 0, λW I). (2.11)

In the Bayesian framework, we place a full Gaussian prior over each Wk in a manner analogous to

the other latent features

(W | µW ,ΛW ) ∼M∏k=1

N (Wk | µW ,ΛW ). (2.12)

The Bayesian extension hierarchically places Gaussian-Wishart priors on the latent feature hyperpa-

rameters µW ,ΛW in a manner analogous to the other latent features

(µW ,ΛW ) ∼N (µW |µ0, β0ΛW ) · W(ΛW |W0, ν0). (2.13)

As previously presented in the literature, constrained PMF was with respect to shifting user features.

It can be made abstract. The Wk introduce a new set of features Wk, k = 1, . . . ,M for each column of


the rating matrix R. The prediction for the (i, j) entry of the rating matrix is

E[ri,j | Ui, Vj ,W1:M ] =E

γi + ηj +

(Ui +

1

ni

M∑k=1

Ii,kWk

)>Vj

=E[γi] + E[ηj ] + E[U>i Vj ] +

1

ni

M∑k=1

Ii,kE[W>k Vj ]

=Evanilla[ri,j ] +1

ni

M∑k=1

Ii,kE[W>k Vj ],

(2.14)

where Evanilla[ri,j ] is the expectation of ri,j under the “vanilla” matrix factorization model with only

user and item (abstractly, row and column) latent features. Notice that this shifts the prediction of the

baseline model for a given user i by the average of an inner product between the column feature Vj and

the set of side features associated with other columns also rated (viewed) by that user.

This abstraction not only highlights the symmetric nature of these latent side features, it also high-

lights how the expectations are shifted relative to the baseline. The most reliable inference on prediction

of ratings can be obtained when these additional side features Wk are associated with the dimensionality

of the matrix with less sparsity (row-wise or column-wise).

Therefore, a similar constraint can be placed on the items. This can prove advantageous if the

recommendation system tends to have more sparsity across columns than across rows. By symmetry

of the model, this is as simple as transposing the user / item matrix. The model without side features

is invariant to this transposition. For the model with side features, the transposition modifies the

expectation of the rating to be

E[ri,j |Ui, Vj ,W1:M ] =E

[γi + ηj + U>i

(1

mj

N∑`=1

I`,jW` + Vj

)]

=E[γi] + E[ηj ] +1

mj

N∑`=1

I`,jE[U>i W`] + E[U>i Vj ]

=Evanilla[ri,j ] +1

mj

N∑`=1

I`,jE[U>i W`],

(2.15)

where we have defined mj =∑N`=1 I`,j as the number of users who provided feedback for item j.

When the side features Wk offset the user features Ui, the contribution from each user to the inner

product is affected. The practical impact of this is an improvement in prediction for users with few ratings

in the system. Similarly, when the side features Wk offset the item features Vj as in Equation (2.15),

the contribution from each item to the inner product is affected, improving prediction for items that

are rare in the system. Therefore, overall performance will be improved when the side features Wk are

associated with the dimensionality of the matrix that is typically more sparse.

2.4 Inference

The predictive distribution for the ratings ri,j is obtained by integrating out the features, the hyperpa-

rameters, and any other variables included in the model. This integral is computationally intractable,


requiring the use of approximate methods. There are two general approaches in the literature, corre-

sponding to two different approximations.

The first is to use Monte Carlo inference to obtain an approximation to the true posterior distribution.

This approach is common in the statistical communities. This can require the choice of a proposal

distribution for sampling, and the computation of acceptance / rejection rates. Such an acceptance /

rejection scheme can be problematic, requiring tuning of the proposal distribution in order to obtain a

reasonable acceptance rate. However, it is often the case that the prior distributions can be chosen such

that the posterior sampling distributions are available in closed form, allowing for Gibbs sampling. This

is the case with the models discussed in this chapter, the vanilla PMF model [30, 31] and the constrained

PMF extension.

The second is to use variational inference to obtain exact inference on an approximation to the true

posterior distribution [39]. This requires the choice of a set of independence assumptions between the

variables. Once the independence assumptions are made, the exact distributional form of the variational

approximation is determined by minimizing the Kullback-Leibler (KL) divergence between the true

posterior and the variational approximation. These have been used for other model variations [17].

The two methods are similar in that they both make inference on a distribution. They are different

in that MCMC methods attempt to approximate the true posterior, while variational methods provide

exact results on the “best approximation” (measured by KL divergence) to the true posterior given

certain independence assumptions. In the models we consider, the set of independence assumptions

common in the literature lead to the same distributional form as the Gibbs updates, and so the only

practical difference is the prediction method. In MCMC (and specifically, in Gibbs), the prediction over

samples of the features is averaged and used for inference. In variational methods, the mean of the

distribution after each update is used for prediction. This is explained further in Section 2.5.

For our experimental results, we analyze the performance of both inference methods for the models

considered.

2.4.1 Gibbs Sampling

The prior distributions for the latent features are conjugate to the likelihood, yielding tractable Gibbs

sampling distributions that are easy to sample from. In particular, all latent features discussed are

multivariate Gaussian distributions.

The posterior for the user feature Ui is a multivariate Gaussian,

Ui ∼N (Ui|µUi ,ΛUi , R, µU ,ΛU ),

where µUi =Λ−1Ui

[ΛUµU + δUτ

M∑j=1

Ii,jVj

(ri,j −

δWniV >j

(M∑k=1

Ii,kWk

))]

ΛUi =ΛU + δUτ

M∑j=1

Ii,jVjV>j .

(2.16)

The equivalent form for the model without side features is obtained by excluding the term with the

side features Wk, and has been published previously in the literature [31].

The posterior for the item feature Vj is also a multivariate Gaussian. It is obtained by setting Ui in

the posterior of the item features in the vanilla BPMF model equal to Si in Equation 2.10 [31]


Vj ∼N (Vj |µVj ,ΛVj , R, µV ,ΛV ),

where µVj =Λ−1Vj

[ΛV µV + τ

N∑i=1

Ii,jSiri,j

]

ΛUi =ΛU + τ

M∑j=1

Ii,jVjV>j .

(2.17)

To our knowledge, the posterior for the side features Wk in the Bayesian extension has not previously

been presented in the literature. It is also a multivariate Gaussian. For compactness, first define the

prediction made without Wm as

ri,j,−Wm =

δUUi +δWni

∑k 6=m

Ii,kWk

> Vj . (2.18)

With this definition, the posterior sampling distribution is a multivariate Gaussian, parametrized by

Wm ∼N (Wk | µWk,ΛWk

)

where ΛWm=Λw + δ2W τ

N∑i=1

M∑j=1

Ii,jIi,mn2i

VjV>j

µWm=Λ−1Wm

Λwµw + δW τ∑(i,j):

Ii,jIi,m=1

1

niVj (ri,j − ri,j,−Wm

)

.(2.19)

Similar expressions are obtained if the side features Wk are associated with rows of the rating matrix

by exchanging the notation Ui and Vj in the above expressions.

2.4.2 Variational Inference

As discussed in Section 2.4, variational algorithms start with a set of independence assumptions on the

parameters of interest. This set of independence assumptions defines a variational approximation Q that

we perform inference on. Optimizing Kullback-Leibler divergence between the true posterior and the

variational approximation defines the form of the distribution Q.

Given data D and a set of parameters θ, Kullback-Leibler (KL) divergence between the variational

approximation Q and the true distribution p is defined by,

KL(Q|p) =

∫Q(θ) log

Q(θ)

p(θ | D)∂θ. (2.20)

This can be re-expressed as

KL(Q|p) =

∫Q(θ) log

Q(θ)

p(θ | D)∂θ

=

∫Q(θ) log

Q(θ)

p(θ,D)∂θ + log p(D).

(2.21)

The second term on the right hand side is independent of Q, and hence independent of θ. This can

be expressed as


log p(D) =KL(Q|p)−∫Q(θ) log

Q(θ)

p(θ,D)∂θ

=KL(Q|p) +

∫Q(θ) log

p(θ,D)

Q(θ)∂θ

=KL(Q|p) +

∫Q(θ) log p(θ,D)∂θ −

∫Q(θ) logQ(θ)∂θ

=KL(Q|p) + EQ[log p(θ,D)]−H(Q),

(2.22)

where EQ denotes the expectation under the variational approximation Q, and H(Q) denotes the

entropy of the distribution.

Equation (2.22) expresses the fixed quantity log p(D) as the sum of the KL divergence, the expected

complete log-likelihood, and the entropy. Minimizing KL is therefore equivalent to jointly maximizing

the second and third terms. The sum of these two terms is known as the variational lower bound.

We use a structured mean field approximation for the distribution Q,

Q(γ1:N , U1:N , η1:M , V1:M ,W1:M , τ, µU ,ΛU , µV ,ΛV , µW ,ΛW | D)

=Q(τ)

N∏i=1

Q(γi)Q(Ui)

M∏j=1

Q(ηj)Q(Vj)

M∏k=1

Q(Wk)

×Q(µU ,ΛU )Q(µV ,ΛV )Q(µW ,ΛW ).

(2.23)

This approximation assumes pairwise independence between all the latent features Ui, Vj ,Wk, while

allowing for structure in the latent feature hyperparameters. Such mean field approximations have been

used in the literature in comparable research [17].

Under this mean field approximation, the variational lower bound factorizes,

EQ[log p(θ,D)]−H(Q)

=∏`

EQ` [log p(θ,D)]−∏`

H(Q`),(2.24)

where ` indexes each term in the product of Equation 2.23.

Optimizing with respect to Q yields the distributional form of the variational approximation Q. We

refer to Appendix C for the full derivation, and summarize the results here.

The distributional form for Q(τ) is Gamma,

Q(τ) =G(τ | aτ , bτ )

where aτ =aτ +1

2

N∑i=1

M∑j=1

I i, j

bτ =bτ +1

2

N∑i=1

M∑j=1

Ii,j(ri,j − ri,j)2.

(2.25)

The distributional form for Q(Ui) is multivariate Gaussian,


Q(Ui) =N (Ui | µUi ,ΛUi)

where µUi =Λ−1Ui

ΛUµU + δUτ

M∑j=1

Ii,jVj

(ri,j − δWV >j

(∑Mk=1 Ii,kWk

ni

))ΛUi =ΛU + δUτ

M∑j=1

Ii,jVjV>j .

(2.26)

The distributional form for Q(Vj) is multivariate Gaussian,

Q(Vj) =N (Vj | µVj ,ΛVj )

where µVj =Λ−1Vj

[ΛV µV + τ

N∑i=1

Ii,j(ri,j − (δUUi +δWni

M∑k=1

Ii,kWk)>Vj)

]

ΛVj =ΛV + τ

N∑i=1

Ii,j

(δUUi +

δWni

M∑k=1

Ii,kWk

)(δUUi +

δWni

M∑k=1

Ii,kWk

)>.

(2.27)

The distributional form for Q(Wm) is multivariate Gaussian,

Q(Wm) =N (Wk | µWm ,ΛWm)

where µWm=Λ−1Wm

ΛWµW + δW τ∑(i,j):

Ii,jIi,m=1

1

niVj (ri,j − ri,j,−Wm

)

ΛWm

=ΛW + δW τ∑(i,j):

Ii,jIi,m=1

1

n2iVjV

>j ,

(2.28)

where ri,j,−Wmwas defined in Equation (2.18).

Optimization is symmetric with respect to the three sets of feature hyperparameters. Given this, we

only give the explicit derivation for the user hyperparameters. The distributional form for Q(µU ,ΛU ) is

a Gaussian-Wishart

Q(µU ,ΛU ) =N (µU | µU , βN ΛU ) · W(ΛU | νU , WU )

where U =1

N

N∑i=1

Ui

µU =NU + β0µ0

N + β0

βN =β0 +N

ΛU =(N + β0)ΛU

νU =N + ν0

W−1U =W−10 +Nβ0N + β0

(U − µ0)(U − µ0)> +

N∑i=1

(Ui − U)(Ui − U)>.

(2.29)


2.5 Predictions


Prediction for Monte Carlo inference is performed by generating samples over sampling runs t of the

features U (t)i , V

(t)j ,W

(t)k and biases γ(t)i , η

(t)j for i = 1 : N, j = 1 : M,k = 1 : M from a Markov

Chain with stationary distribution being the posterior distribution over the model parameters and the

hyperparameters. The Monte Carlo prediction is taken as the mean of the distribution of the rating,

conditional on the samples. In other words, the average over the samples,

E[ri,j ] =1

T

[ T∑t=1

γ(t)i + η

(t)j + U

(t)i

>V

(t)j

]. (2.30)


The prediction for the variational algorithm is the expected rating under the variational approximation

with the current inferred distriburion, EQ[ri,j ]

EQ[ri,j ] =EQ[γi + ηj + U>i Vj ]

=EQ[γi] + EQ[ηj ] + EQ[Ui]>EQ[Vj ].

(2.31)

The final equality follows from the linearity of the expectation operator and from the independence

assumptions made by the mean field approximation.

2.6 Experimental Setup

We experiment on two sets of data: the MovieLens 1M data set and the Epinions data set. Both of

these data sets have been used in the literature frequently in the collaborative filtering context [15, 17].

Descriptive statistics for the two data sets are given in Table 2.1

MovieLens 1M consists of 1, 000, 209 ordinal ratings on the scale 1, 2, 3, 4, 5 by N = 6, 040 users

on M = 3, 952 items. To make a direct comparison to previously reported variational results [17], we

removed any movies rated less than three times, and ensured that each user and movie appeared in the

training set once. The data was split into a 70% training, 30% testing set for evaluation. We report

root-mean-square error (RMSE) on the test set.

Epinions consists of 664,824 ordinal ratings on the scale 1, 2, 3, 4, 5 by N = 49, 290 users who rated

M = 139, 738 items. We ensured that each user and each item appeared at least once in the training

set. No other conditions were imposed on the train / test split. The data was split into a 70% training,

30% testing set for evaluation. We report root-mean-square error (RMSE) on the test set for the models

considered.

Section 2.3 abstracted the notion of constrained PMF as an additional set of latent features associated

with either rows or columns. When associated with columns, these features shifted the prediction for

each row, and vice versa when associated with rows. The optimal choice is to associate the additional

set of latent features with the dimension of the matrix with less sparsity. For MovieLens, we associate

the side features with columns, while for Epinions, we associated the side features with rows. Given


Table 2.1: Summary data on the MovieLens 1M and the Epinions data set.

MovieLens EpinionsNumber of Ratings 1,000,209 664,824

Number of Users 6,040 49,290Number of Items 3,952 139,738

Ratings per Item

Min 0 025th 23 150th 104 1

Mean 166 375th 323 2Max 3,428 1408

Ratings per User

Min 20 025th 44 150th 96 3

Mean 253 975th 208 9Max 2,314 724

Sparsity 4.19% 0.01%

the number of users and items in each system, this choice introduces the smallest number of additional

feature vectors.

To make clear, we consider the following three sets of models. The first includes only the user and

item biases. The second is the vanilla PMF model with user and item features and biases. The third

is the constrained PMF model that adds in the side features in addition to the existing user and item

features. A MAP estimate was generated for each model for each data set, and Gibbs sampling was

initiated from these MAP estimates.

For Gibbs sampling, experimentation with a different number of samples was used to determine a

point at which convergence occurred. Burn-in was ignored. An exploratory analysis of traceplots of the

feature vectors suggested quick mixing, and the initial decline in the overall test error was rapid. Com-

bined, both of these suggest that allowing for burn-in would have minimal improvement. Convergence

of the variational algorithm was assessed using the variational lower bound.

Unless otherwise noted, all simulations that followed used the following choices for the tuning pa-

rameters. For the Gaussian-Wishart priors on the feature vectors, (µU ,ΛU , β0, ν0) = (0d×1, Id, 1, d+ 1).

The mean value was chosen to reflect that the features are mean zero after accounting for the biases,

while the values for the scale matrix and degrees of freedom were selected to give a vague prior that was

still proper.

2.7 Experimental Results

Table 2.2 summarizes the test RMSE values obtained on the two data sets under the different models

and inference algorithms considered. The subsections that follow describe these results in detail. To

summarize our results, we find:

• The variational algorithm tends to overfit;

• The degree the Variational algorithm overfits is dependent on the choice of hyperparameters,


Table 2.2: Overall test error rates on the MovieLens 1M and Epinions data set under the precision andinference models considered. MAP estimate values are in parenthesis.

Data Parameters Inference RMSE MAP

MovieLens

Biases Either 0.9101 0.9210BPMF Gibbs 0.8452

0.8880BPMF VI 0.8546

BCPMF Gibbs 0.8407 0.8805

EpinionsBiases Either 1.0460 1.1298BPMF Gibbs 1.0455

1.1211BPMF VI 1.0550

BCPMF Gibbs 1.0457∗ 1.1134†The variational results are reported with an alternate choice of hyperparamters, as discussed in the

analysis below.∗ The sparsity of the Epinions data set limits the incremental benefit of side features for this data set.

specifically the Wishart scale matrix W0;

• The most significant gain in performance results from including side features to model correlational

influence.


We first consider the performance of the variational algorithm on the model with no side features and

with the default choice of hyperparameters. Under this setup, the variational lower bound monotonically

increases and converges within the first 10 full updates of the parameter set, as illustrated in Figure 2.2

(a). However, there is clearly overfitting in the test set. The model is unable to improve upon the MAP

estimate, see Figure 2.2 (b)

Further investigation shows that the MAP estimate obtained is probabilistically unlikely under the

prior selected. Specifically, the hyperparameters of the Gaussian-Wishart priors do not favour the set of

features obtained under MAP estimation. The MAP values do not suggest a Wishart scale matrix set

to the identity. We re-run variational inference using a modified set of hyperparameters. Letting U(0)i

and V(0)j denote the features obtained in the MAP estimate, we set

W−10 =1

2diag

(N∑i=1

U(0)i U

(0)i

>)

+1

2diag

(M∑J=1

V(0)j V

(0)j

>). (2.32)

For our MAP estimate, this creates a scale matrix W0 with diagonal elements ranging in numerical

value from 38 − 82. This is an order of magnitude larger than the identity scale matrix that we see

will provide good performance for Gibbs sampling. We refer to this modified choice of prior as the

“MAP-driven” prior.

This choice of hyperparameters allows the variational algorithm to increase the lower bound for more

updates, tending to converge closer to 50 updates, see Figure 2.2 (c). In addition, the RMSE values

obtained on the test set are more favourable, see Figure 2.2 (d). However, the algorithm still overfits

slightly on the test set. The training lower bound starts to converge after approximately 50 full updates

of the variables, while the test error reaches a minimum of 0.8538 after 26 updates. By the time the


Figure 2.2: (left) Variational lower bound and (right) RMSE for the training and test sets for the PMFmodel with (top) default hyperparameters and (bottom) an alternative choice of hyperparameters.

(a)

0 10 20 30 40 500.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

Update

TrainTest

(b)

0 10 20 30 40 500.85

0.855

0.86

0.865

0.87

0.875

0.88

0.885

0.89

Update

RM

SE

TrainTest

(c)

0 10 20 30 40 500.25

0.3

0.35

0.4

0.45

0.5

Update

TrainTest

(d)

0 10 20 30 40 500.76

0.78

0.8

0.82

0.84

0.86

0.88

Update

RM

SE

TrainTest

lower bound has converged, this has marginally increased to 0.8546. The marginal increase suggests

stability in the test error, but the increase in test error is not desirable.


Holding the set of latent features fixed, inference via Gibbs sampling outperforms the variational mean

field approximation. This comparison is trivial when we consider the default choice for the Wishart scale

matrix of W0 = Id×d, since the variational methods overfit quickly.

For a more interesting comparison, we consider the performance of the Gibbs sampler with the

default prior and the variational algorithm with the “MAP-driven” prior. For simplicity, we focus the

discussion on the baseline model without side features. We select the Gibbs iteration and variational

update for which the overall test errors are near equal. From Figure 2.3 (a), this is the 29th update of

the parameters under variational inference, and the 29th iteration of the Gibbs sampler.

With W0 driven by the MAP estimate, the variational algorithm outperforms the Gibbs sampler in

overall test error for approximately the first 30 iterations. This performance gain is motivated by drops


Table 2.3: Test RMSE broken down by user frequency under Gibbs sampling for the models with andwithout side features.

Error RelativeNumber of Ratings BPMF BCPMF Change (%)

≤ 25 0.9120 0.9035 +0.9226− 71 0.8674 0.8608 +0.76

72− 146 0.8561 0.8494 +0.78147− 171 0.8508 0.8459 +0.58172− 301 0.8193 0.8155 +0.46302− 484 0.8275 0.8243 +0.38485− 829 0.8109 0.8107 +0.03

830− 2, 313 0.7475 0.7474 +0.01

in the first five iterations. After the first five iterations, the variational algorithm experiences diminishing

returns. However, the Gibbs sampler continues to drop at a similar rate beyond this point.

The two algorithms have approximately the same overall error rate on the test set, to within 0.0001,

after the 29th iteration / update. Figure 2.3 (b) illustrates the error of the two inference methods at

this point with respect to user frequency. The difference in the performance of the two algorithms is

on the order of 0.001 or less, except for the most frequent bin. This bin corresponds to the 10% most

frequent users, and the Gibbs sampler outperforms the variational algorithm.

The performance gap between the Gibbs sampler and the variational approximation for the most

frequent users suggests that the sampling distribution of the user feature vectors has noticeable variabil-

ity. Figure 2.3 (c) illustrates this by plotting the maximum variance of the d-dimensional user features

against the number of ratings the user has in the training set. These are representative values for both

inference algorithms after convergence. The vertical gap between the points for the Gibbs sampler and

the variational approximation indicates that the variational approximation tends to produce smaller

estimates of the variance than the Gibbs sampler. The vertical line represents users with 829 ratings in

the training set, which is the value above which users are included in the last bin in Figure 2.3 (b). The

persistent significant difference between these two beyond this point means that there is still variability

in the distribution of the user features that the Gibbs sampler is exploiting.

2.7.3 Side Features

We consider the predictive gain when including the side features in the model. Table 2.3 tabulates the

converged prediction error for the sampler in the model with only user and item features (ie: BPMF)

and the model with user, item, and side features(ie: BCPMF) with respect to user frequency. For

extremely common users (over 800 ratings), there is essentially no predictive gain. As expected, there

is a predictive gain for the least common users (with number of ratings on the order of 20− 30).

It is interesting to note that there is still a noticeable gain in test performance for moderately frequent

users, those with several hundred ratings. This gain highlights the importance of correlational influence

[15], which is being captured by the side features.


Figure 2.3: (a) Overall test error for the Gibbs sampler with default hyperparameters and variationalalgorithm with the MAP-driven hyperparameters for the PMF model without side features. (b) Thetest error with respect to user frequency. The two are nearly identical for users of all frequency, withthe exception of the most frequent users. In this bin, the Gibbs sampler outperforms the variationalapproximation. (c) Maximum variance for the user features plotted against the user frequency underboth Gibbs sampling and the Variational algorithm after convergence.

(a)

0 10 20 30 40 500.84

0.85

0.86

0.87

0.88

0.89

0.9

Iteration / Update

Te

st

RM

SE

GibbsVI

(b)

25 71 146 171 301 484 829 23130.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

Number of Ratings (Upper Bin Edge)

Te

st

RM

SE

GibbsVI

(c)

101

102

103

104

10−3

10−2

10−1

100

Number of Ratings

Ma

xim

um

Dia

go

na

l V

ari

an

ce

Variational

Gibbs


Figure 2.4: Test RMSE for the Gibbs sampler with and without side features.

(a)

0 50 100 150 2000.84

0.845

0.85

0.855

0.86

0.865

0.87

0.875

0.88

Sample

Test

RM

SE

No Side Features

Side Features

2.8 Conclusion

In this chapter, we reviewed the baseline matrix factorization model for collaborative filtering. We

reviewed one constrained PMF model that was introduced to improve recommendation for cold start

users, and extended this to the Bayesian framework. We provided a comparison between Gibbs sampling

and variational inference, noting that variational inference requires precise tuning of the Gaussian-

Wishart priors for optimal performance and is also prone to overfitting. Based on this, we advocate the

further user of Monte Carlo methods for prediction in these models.

An analysis of the performance of the Gibbs sampler with respect to user frequency demonstrated

that the inclusion of side features offers predictive gains for even moderately common users, those with

several hundred ratings. This highlights the importance of modeling correlational influence in the rating

patterns.

It is worth noting, however, that the computation time required for sampling the side features is

substantial. Sampling a single side feature Wk requires considering the subset of the rating matrix

consisting of all users who rated a given item. That is, one must consider all users ui for which Iui,k = 1,

and all items each of these users rated. For globally popular items, this can be a substantial proportion

of the original data set. Future work can look at probabilistic ways to select items for which these side

features will be included. Subsequent chapters will look at alternative models for constraining features

that have less computational complexity.

Chapter 3

Precision Models for Matrix

Factorization

The previous chapter dealt with constrained PMF models that shift the expected rating based on taste

similarity between users or rating similarity between items. In the probabilistic framework, these meth-

ods are shifting the first moment of the expectation for each entry in the rating matrix. Other work

has looked at shifting the precision for each entry in the rating matrix [17]. These extensions allow for

heteroskedastic extensions of Bayesian PMF. In general, these extensions modify the likelihood for the

rating to

(ri,j | Ui, Vj) ∼N (ri,j | U>i Vj , τi,j), (3.1)

where τi,j : N× N 7→ (0,∞) is now a function of the users and items.

Previous work has explored different approaches for introducing this heteroskedastic variance. In

particular, one extension grounded in multiplicative user / item precision factors was proposed and

found to produce lower test RMSE than the vanilla constant variance model. In this chapter, we review

the previously proposed approaches, revisit the results, and show the gains are disproportionately coming

from the most common users in the recommendation system. That is, the performance gain for including

these precision factors is improving performance for only a few very frequent users.

The predicted ratings for the most frequent users are the most accurate. In addition, the most

frequent users are actively demonstrating they are engaged in the system. In light of these, they are

practically not the users that we need to be focusing on improving recommendations for. In addition,

we show that the least frequent users do not gain better test error performance by the introduction

of the heteroskedastic variance model. We identify an overfitting issue with the estimated precisions

that causes this increase in test RSME, and highlight that this issue is present when using variational

methods for inference. This observation mirrors the overfitting problem for the variational algorithm in

Chapter 2.

To alleviate the overfitting issue, we propose a truncated precision model. Specifically, we

• outline the probabilistic framework for truncated precisions

• prove the variational distribution under a mean field approximation, and the proper estimates of

the truncated precisions

25

Chapter 3. Precision Models for Matrix Factorization 26

Figure 3.1: Bayesian Heteroskedastic Probabilistic Matrix Factorization with Gaussian-Wishart Priorsover the latent user, item, and side feature vectors. The user precision factors αi and item precisionfactors βj allows for non-constant variance in the observed preference ri,j .

ri,j

i = 1 : N

µU

ΛU

Uiµγ

λγ

γi

αi

aUbU

j = 1 : M

µV

ΛV

Vj

µη

ληηjβj

aV bV

k = 1 : N

Wk

µW

ΛW(µ0, β0)

(W0, ν0)

τ

aτ bτ

• develop a Gibbs sampler for the truncated precisions

• provide experimental results using data sets common in the literature

• connect the limiting behaviour of the truncated precision model to the constant variance and

non-constant variance models

3.1 Existing Noise Models for BPMF

Work in the literature has looked at exploring non-Guassianity through combinations of two model

variations. The first is by replacing the Gaussian priors for the features Ui and Vj by Student-t priors.

Equivalently, this approach places joint distributions on the latent features and a precision factor

(Ui, αi) ∼N (Ui | µU , αiΛU )G(αi)

(Vj , βj) ∼N (Vj | µV , αjΛV )G(βj).(3.2)

Here, X ∼ G(x | a, b) denotes a random variable with the Gamma distribution parametrized with

rate b. This corresponds to the density

p(x | a, b) ∝xa−1e−bx. (3.3)

Analytically integrating out the αi and βj produces Student-t distributions for Ui and Vj


(Ui | µU ,ΛU ) ∼∫N (Ui | µU , αiΛU )G

(αi

∣∣∣∣ 1

2aU ,

1

2bU

)dαi

(Vj | µV ,ΛV ) ∼∫N (Vj | µV , βjΛV )G

(βj

∣∣∣∣ 1

2aV ,

1

2bV

)dβj .

(3.4)

Alternatively, multiplicative models for the precision on the likelihood term have been considered,

(ri,j | Ui, Vj , αi, βj , τ) ∼N (ri,j | U>i Vj , αiβjτ)

(αi | aU , bU ) ∼G(αi

∣∣∣∣ 1

2aU ,

1

2bU

)(βj | aV , bV ) ∼G

(βj

∣∣∣∣ 1

2aV ,

1

2bV

).

(3.5)

The literature previously tested the suitability of these extensions with a variational mean field

approximation. One variational approximation considered the independence assumptions

q(U1:N , α1:N ) =

N∏i=1

q(Ui, αi)

q(V1:M , β1:M ) =

M∏j=1

q(Vj , βj).

(3.6)

This choice of independence assumptions lead to Student-t distributions for the features Ui, Vj . Sim-

ilarly, the form of the variational approximation for the multiplicative model was chosen as

q(U1:N , V1:M , α1:N , β1:M ) =

N∏i=1

q(Ui)a(αi)

M∏j=1

q(Vj)q(βj). (3.7)

This choice of independence assumptions lead to Gamma distributions for αi, βj . Experimental

results reported with these mean field approximations suggested that the user / item multiplicative

precision model improved overall performance, but that Student-t priors for the latent features did

not. Our work reviews the multiplicative model, outlines an overfitting issue, and proposes a truncated

precision model as an alternative. We demonstrate how the truncated precision model does not suffer

from the same overfitting issues.

For brevity, in the work that follows, we adopt the following shorthands:

1. The “constant”, or homoskedastic, model, meaning the vanilla PMF model where E[ri,j ] = U>i Vj ,

and Var[ri,j ] = τ−1;

2. The “robust”, or heteroskedastic, model, referring to the proposed multiplicative model referenced

above and in Equation 3.7. Here, E[ri,j ] = U>i Vj and Var[ri,j ] = (αiβjτ)−1;

3. The “truncated” model, to be outlined in this chapter.

3.2 Truncated Precisions

The choice of the Gamma distribution for the precision factors is computationally convenient. This

choice leads to relatively simple distributional forms for a variational mean field approximation, and is


conjugate in the case of Gibbs sampling. In practice, this choice is inappropriate. It is limited in that

the Gamma distribution is unbounded towards zero and infinity. In the context of the recommendation

system, a user (equiv. item) precision that is close to zero results in complete vagueness for the ratings

for that user (equiv. item), allowing the posterior distribution for the rating to be arbitrarily broad.

Similarly, a user (equiv. item) precision that is arbitrarily large results in delta functions for the posterior

distribution of the rating.

In the experimental results that follow in this chapter, we noticed this Gamma model for the precisions

posed an inference problem when direct minimization of the variational objective functions was required.

For instance, the variational lower bound under the mean field approximation contains a scaled sum of

squared errors from the likelihood

−1

2τ

N∑i=1

M∑j=1

Ii,jαiβj(ri,j − E[ri,j ])2. (3.8)

The deterministic nature of variational inference may optimize this by arbitrarily shrinking some

user and item precisions to zero, while driving others arbitrarily large. The result is a decrease in the

overall error by optimizing for a subset of the user-item matrix.

A solution to this pathological behaviour is to bound the precisions to values suggested by the actual

data. Such truncation is commonly applied to different distributions, such as the Gaussian distribution.

Truncation of random variables appears often enough in practice that standard software packages contain

support this [28].

In general, an unbounded distribution with density gX(x) is truncated to (`, u) by defining fX(x) ∝gX(x)1` < x < u. For the case of the Gamma precisions in Equation (3.5), the density becomes

fX(x) ∝xα−1e−βx1` < x < u. (3.9)

The introduction of the truncated Gamma distribution introduces two additional tuning parameters:

` and u. We show that these tuning parameters for the lower and upper bound can be chosen sensibly

from the data.

3.3 Inference

The multiplicative models for Gamma noise and truncated Gamma noise are independent of the side

feature extension in Chapter 2. Consequently, the multiplicative models can be considered extensions

to the side feature model, or the vanilla PMF model. This leads to four possible models by selectively

including side features and selectively including precisions.

With respect to inference, all four models readily permit inference using either Monte Carlo methods

or variational inference. All four models are still conjugate in the Gibbs framework, allowing for efficient

Gibbs sampling. It can be shown that under an appropriately chosen mean field approximation, the

form of the variational distribution is still tractable. We outline in this section the necessary extensions

to inference from Chapter 2.



The choice of Gamma distribution as a prior for the multiplicative user and item precisions is conjugate

to the likelihood. In particular, the posterior for the user precisions is a Gamma,

(αi | D) ∼G(αi | aUi , bUi)

where aUi =aU +1

2

M∑j=1

Ii,j

bUi =bU +τ

2

M∑j=1

Ii,jβj(ri,j − ri,j)2.

(3.10)

A similar expression for the item precisions holds by symmetry

(βj | D) ∼G(βj | aVj , bVj

aVj =aV +1

2

N∑i=1

Ii,j

bVj =bV +τ

2

N∑i=1

Ii,jαi(ri,j − ri,j)2.

(3.11)

This parametrization continues to hold if the priors for the precisions are truncated Gamma distri-

butions. Let X be an arbitrary precision random variable with truncated Gamma density. Then X has

density

fX(x) ∝xα−1e−βx1(x ∈ (a, b)). (3.12)

However, the sampling algorithm requires the introduction of an auxiliary variable [3]. Introduce the

auxiliary variable Y ∼ Unif(0, e−βx). With this auxiliary variable, a conditional density for the precision

can be defined. Continuing to let X denote an arbitrary precision, we have

fX|Y (x|y) ∝xα−11(x ∈ (a,minb,− log(y)/β). (3.13)

In the context of Monte Carlo inference, only − log(y)/β is needed for sampling. The CDF for this

satisfies

P

(− 1

βlog y > u

)=P (log y < −βu)

=P (y < e−βu)

=e−βu

e−βx

=e−β(u+x).

(3.14)

This expression defines the CDF for the auxiliary variable − log(y)/β. Therefore, the inverse CDF

method can be used for sampling. Explicitly, let p ∈ (0, 1),

1− p =1− FY (u)

=⇒ − 1

βlog(1− p) =u+ x.

(3.15)


With a sample of − log(y)/β, the inverse CDF method can be subsequently used to generate a sample

for the precision random variable X. Define M = minb,− log y/β as the minimum of the upper bound

b and the sampled auxiliary variable. The required normalizing constant is defined by

Z−1 ≡ βα

Γ(α)

∫ M

a

xα−1 dx

βα

Γ(α)

Mα − aα

α.

(3.16)

Therefore, the inverse CDF technique will yield the needed precision sample y ∼ X by satisfying the

equation

p =

∫ y

a

Zβα

Γ(α)xα−1 dx

=

(1

Mα − aα[xα]yx=a

)=

(yα − aα

Mα − aα

).

(3.17)

The solution to this equation gives our sample, y, as

y = (p(Mα − aα) + aα)1/α

= (pMα + (1− p)aα)1/α

.(3.18)

This solution poses a numerical issue. From Equation (3.10)-(3.11), α = aU , bU, which grows with

the number of observed values in the row or column of the rating matrix. These can easily be large,

leading to numerical overflow. The solution is to compute Equation 3.18 on the log scale and rewrite.

log y =1

αlog (pMα + log(1− p)aα)

=1

αlog(elog(pM

α) + e(1−p)aα)

=1

αlog(elog p+α logM + elog(1−p)+α log a

).

(3.19)

The log-sum-exp trick can now be used to sample log y, and in turn y.

The heteroskedastic precision model re-weights the terms in the previous model, leading to a different

posterior sampling distributions. These are outlined in Appendix C and Appendix B.


We extend the variational mean field approximation used in Section 2.4.2. Specifically, we modify

Equation 2.23 to include the αi and βj , retaining pairwise independence,

Q(γ1:N , U1:N , α1:N , η1:M , V1:M , β1:M ,W1:M , τ, µU ,ΛU , µV ,ΛV , µW ,ΛW | D)

=Q(τ)

N∏i=1

Q(γi)Q(Ui)Q(αi)

M∏j=1

Q(ηj)Q(Vj)Q(βj)

M∏k=1

Q(Wk)

×Q(µU ,ΛU )Q(µV ,ΛV )Q(µW ,ΛW ).

(3.20)

Under this mean field approximation, it can be shown that Q(αi) and Q(βj) are also Gamma distri-


butions if αi and βj are Gamma distributions according to p(θ,D). Equivalently, Q(αi) and Q(βj) are

truncated Gamma distributions if αi and βj are truncated Gamma distributions according to p(θ,D).

MAP estimation for the truncated Gamma distribution is still available in closed form. Consider the

case where a precision random variable X is a Gamma distribution, with parameters (a, b), truncated

to the interval (`, u). The MAP estimate is the intuitive estimate

max

(a,min

(α− 1

β, b

)). (3.21)

This corresponds to the usual MAP estimate, truncated to the boundaries of the truncated Gamma

distribution.

The proof is as follows. The derivative with respect to the density yields the critical point x∗ =

(α− 1)/β. If this is in [a, b], we are done. If not, then we need to check the boundaries.

Consider the case x∗ > b (the case x∗ < a is similar). We need to show that f(b)/f(a) > 1 for b to

be the MAP estimate. We have

f(b)

f(a)=f(b)/f(x)

f(a)/f(x). (3.22)

For the numerator,

f(b)

f(x)=bα−1e−βb

xα−1e−βx

=

(b

x

)α−1e−β(b−x),

(3.23)

while the denominator yields

f(a)

f(x)=aα−1e−βa

xα−1e−βx

=(ax

)α−1e−β(a−x).

(3.24)

Therefore

f(b)

f(a)=

(bx

)α−1e−β(b−x)(

ax

)α−1e−β(a−x)

=

(bx

)α−1(ax

)α−1 e−β(b−x)e−β(a−x).

(3.25)

Since x > b > a, we have 1 > b/x > a/x, so the first term is > 1. The second term is also > 1 since

x > b > a

=⇒ 0 > b− x > a− x

=⇒ 0 < −β(b− x) < −β(a− x)

=⇒ 1 < e−β(b−x) < e−β(a−x).

(3.26)


3.4 Prediction

Prediction for these models with truncated precision is performed identical to the latent feature models

in Chapter 2. For brevity, the reader is referred to Section 2.5 for the details.


We extend our experimental results from Chapter 2, and so we experiment on the same data sets: the

MovieLens 1M data set and the Epinions data set. We make use of the same tuning parameters and the

same MAP estimates discussed in Section 2.6. All precisions were initialized to 1, corresponding to the

vanilla PMF model.


Table 3.1 summarizes the test RMSE values obtained on the two data sets under the different models

and inference algorithms considered. The subsections that follow describe these results in detail. To

summarize our results, we find:

• The variational algorithm tends to overfit;

• Modeling precisions can improve performance, though it may be necessary to bound the precisions

for the deterministic approximations given by variational inference to be sensible;

• The most significant gain in performance results from including side features to model correlational

influence. When these are included, there is no clear benefit of including precisions.

3.6.1 Variational inference

Relative to the homoskedastic PMF models of Chapter 2, assessing the convergence of the robust pre-

cision model is more difficult. The inclusion of user and item multiplicative precision factors allows the

variational algorithm to arbitrarily weight the contribution to the complete log-likelihood from different

rows and columns in accordance with the predictive accuracy of the model. To be concrete, rows and

columns with low accuracy can be down-weighted with small precisions. This is discussed further in

Section 3.6.2 when we compare the predictive performance of the variational algorithm to the predictive

performance of the Gibbs sampler.

To outline, the variational lower bound computed on the training set increases almost linearly, with

the training error smoothly dropping. The test error reaches a minimum of 0.8570 after nine updates.

The variational lower bound computed on the test set decreases for 13 updates. Figure 3.2 (right column)

illustrates this decrease. The lower bound on the test set is able to continue to increase between the

ninth and 13th update, while the test error increases, as the squared error on the test set is weighted in

the lower bound by the user / item precisions.


Table 3.1: Overall test error rates on the (a) MovieLens 1M and (b) Epinions data set under the precisionand inference models considered. MAP estimate values are in parenthesis.

(a)

Constant Robust Truncated (n = 2) MAP

No FeaturesGibbs

0.9101 0.9210VI

BPMFGibbs 0.8452 0.8448 0.8475

0.8888VI† 0.8546 0.8570 0.8521

BCPMF Gibbs 0.8407 0.8407 0.8805

(b)

Constant Robust Truncated (n = 2) MAP

BPMFGibbs

1.0460 1.1298VI

BCPMFGibbs 1.0455

1.1211VI 1.0550

Side Gibbs∗ 1.0457 1.1134

†These results are reported with an alternate choice of hyperparameters, as discussed in the analysisbelow.

∗ The sparsity of the Epinions data set limits the incremental benefit of side features for this data set.

Figure 3.2: (left) Variational lower bound and (right) RMSE for the training and test sets for the robustprecision model with alternative choice of hyperparameters.

(a)

0 10 20 30 40 50−20

−10

0

10

20

30

40

50

Update

TrainTest

(b)

0 10 20 30 40 500.76

0.78

0.8

0.82

0.84

0.86

0.88

Update

RM

SE

TrainTest



The Gibbs sampler avoids the pathological precision issue by virtue of sampling values for the precisions

from the posterior distribution for the precisions. This permits us to assess if modeling precisions

improves performance of users for a given frequency (cold start, rare, frequent, etc). To determine

this, we examine the final converged error rate on a user frequency basis for a model with and without

precisions, holding all else equal. For this comparison, we select the model with side features, and look

at the error under Gibbs sampling. The numerical results are in Table 3.2.

There are only minor departures from equality in the test error for moderately frequent users (the

third and fourth bin), and these departures from equality are in the fourth decimal place of the error.

This represents less than a 1% relative change in predictive performance. The largest difference occurs

for the most frequent users. This is a relative gain of 0.42%. However, these users have predictions

that are already well calibrated relative to the rest of the user set. In addition, the change relative

to the model without precisions is not related to the frequency of the user. While one bin of users

shows an improvement over the vanilla PMF model, the next may in fact show a decrease in predictive

performance. Based on these results, we conclude that modeling precisions does not tend to significantly

favour users of any given frequency in the test set.

The near equality of both the constant and robust precision models could be a result of a near-constant

posterior for the precisions. To check this, we examine traceplots for the user and item precisions, as

well as the histogram of the precisions for a sample at convergence. The distribution of both the user

and item precisions was clearly non-constant and right skewed. Figure 3.3 (a) displays histograms of the

user and item precisions from a Gibbs sampler after convergence, indicating this skewness. Figure 3.3

(b) displays traceplots for a sample of the user and item precisions, showing the sampler mixed well over

a range of values. Both of these indicate that the sampler was exploiting the robust precision model.

Therefore, the near equality in predictive performance is not a result of model degeneracy.

It has been noted that the introduction of precisions prompts the variational algorithm to drive some

precisions arbitrarily small and arbitrarily large. The histograms in Figure 3.3 (a) and traceplots in

Figure 3.3 (b) indicates that the Gibbs sampler does not suffer from this limitation. For comparison,

empirical CDF curves are plotted in Figure 3.3 (c) for the user and item precisions under the variational

algorithm (left panel) and the Gibbs sampler (right panel) after convergence. The two panels are similar,

though the CDFs for the variational algorithm have been plotted on a log scale. In other words, the

converged values of the precisions under the variational algorithm are exponentially larger.

3.6.3 Truncated Precisions

It was noted that the introduction of the precisions leads to pathological results with variational inference.

We explored if bounding the precisions would alleviate this issue. Using the truncated approach discussed

in Section 3.2, we ran experiments bounding the precisions to values sensible for a scale constrained to

the interval [1, 5].

We found that bounds of (1/2, 2) produced results for MovieLens that outperformed both the constant

and robust precision model. These values also delayed the overfitting in the variational approximation

for several updates. Overfitting for this model starts after 20 full updates of the parameters.

Figure 3.4 (a) shows the overall test error of the variational algorithm under the constant, robust,

and truncated precision model with the bounds of (1/2, 2). These three models have similar behaviour


Figure 3.3: (a) Histograms of the user and item precisions from a Gibbs sample after convergence. (b)Traceplots of the user and item precisions (c) CDF curves of the converged user / item precisions forthe model under (left) Variational and (right) Gibbs. Note that the curves have similar shape, but thatthe variational CDF is plotted on a log-scale for comparison.

(a)

0 1 2 3 4 50

500

1000

1500

2000

2500

User Precisions

Precision

Fre

qe

nc

y

0 1 2 3 4 50

500

1000

1500

Item Precisions

Precision

Fre

qe

nc

y(b)

0 100 200 300 400 5000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Sample

Us

er

Pre

cis

ion

Va

lue

0 100 200 300 400 5000

1

2

3

4

5

6

Sample

Va

lue

(c)


in the initial set of parameter updates. Differences appear after the 8th update. At this point, the

variational algorithm overfits for the robust precision model, and continues to drop for the truncated

precision model for 3−4 additional iterations. The rate of increase for the two is approximately the same

until nearly the 40th parameter update, at which point the test RMSE under the truncated precision

model tends to increase at a faster rates.

Figure 3.4 (b) shows the test error by user frequency for the variational algorithm under the two

models after 50 full parameter updates. This is the point that the algorithm has begun to overfit in

the truncated and the robust model, and has stabilized for the constant precision model. This graph

shows the most significant difference in test error is in the most frequent users. The constant precision

model outperforms either heteroskedastic model (robust, truncated) by a difference of at least 0.1 in test

RMSE, a relative improvement in RMSE of 12%. The error rates are approximately the same in other

user bins, with the constant precision model performing slightly worse for moderately frequent users.

This demonstrates that the overfitting issue for the heteroskedastic models stems from some of the most

frequent users in the system.

A sequence of bounds can be formed as (`, u) = (1/n, n) for integer n. With respect to this sequence,

the truncated model has limiting cases of the constant model as n 1 and the robust model as n∞.

This leads to the question of how inference for the truncated precision model changes as n changes?

Figure 3.4 (c) plots the test error of the variational algorithm for several values of n along with the

constant and robust precision model. As expected, larger values of n are similar to the robust precision

curve, while smaller values are similar to the constant precision curve.

We draw attention to the curve for n = 2, corresponding to the precision bounds (1/2, 2). For these

bounds, the variational algorithm obtains a significantly lower error rate than the other choices of n, as

well as for the constant precision model. The consistent tendency for the truncated model to overfit early

in learning for larger values of n suggests that the truncation value has little influence on performance

after a certain point. However, the improvement for the n = 2 case over the constant precision model

does suggest there is value in modeling heteroskedastic precision among different users and different

items.

3.6.4 Overfitting in the Robust Model

It was noted that the robust precision model overfits on the test set. It is important to ask why

this overfitting occurs. To investigate this, recall that the updates for the user (respectively, item)

precision parameters are inversely proportional to the error for that user (respectively item), scaled by

the precisions. In particular, the update rule for the user precision is

α−1i ∝M∑j=1

Ii,jτβj(ri,j − ri,j)2. (3.27)

Equation 3.27 suggests that the inverse precision (ie.: the variance) should have a 1/x2 relationship

with the (scaled) user error. A scatterplot of user precisions versus user error will show if this relationship

holds in both training and testing, and a similar plot will for items. If the model is generalizing well,

the 1/x2 relationship should be clear in both training and testing.

Figure 3.5 (a) plots the user training error and test error against the user precision. The desired

1/x2 relationship is visually clear in the training set, with points rightly scattered in a 1/x2 relationship.


0 10 20 30 40 500.85

0.855

0.86

0.865

0.87

0.875

0.88

Update

Te

st

RM

SE

Robust

Truncated

Constant

(a)

25 71 146 171 301 484 829 23130.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

Number of Ratings (Upper Bin Edge)

Te

st

RM

SE

Robust

Truncated

Constant

(b)

0 10 20 30 40 500.85

0.855

0.86

0.865

0.87

0.875

0.88

Update

Te

st

RM

SE

Constant

n=2

n=5

n=10

n=20

Robust

(c)

Figure 3.4: (a) Overall test error and (b) test error binned by user frequency for the variational algorithmunder the robust and truncated precision model (with precision bounds (1/2, 2)). (c) Overall test errorfor the variational algorithm under the robust, constant, and truncated precision models for differentchoices of bounds.


This is not visually clear in the test set. That is, the relationship dictated by the model is clear in the

training set, but is weak to non-existent in the test set.

This overfitting can be quantified through Pearson correlation. A log-transformation of Equa-

tion (3.27) yields:

− log(αi) ∝ log

M∑j=1

Ii,jτβj(ri,j − ri,j)2 . (3.28)

Equation (3.28) suggests there should be a linear correlation between the set of (negative) log preci-

sions and log error rates on a user / item level. Proper inference will have this linear correlation strong

in the training set, and generalization will have this linear correlation strong in the test set. Conversely,

overfitting will will be indicated by high correlations in the training set, with much lower correlations in

the test set.

We compute these linear correlations for the set of users and items in both the training and test sets,

and plot these correlations over iterations in Figure 3.5 (b). The correlations computed on the training

set are typically large and stable over iterations. The user correlation is consistently above 0.94, while

the item correlation is consistently above 0.70. They are not exactly 1 since the updates are sequential,

while the correlations are computed after a full parameter update. These values indicate that training

is proceeding as expected under the model.

When the same values are computed for the user and item errors in the test set, significantly smaller

values are obtained initially – less than 0.7 for the users and less than 0.4 for the items. This reflects an

initial drop of over 20% between the test and training sets. In addition, these values are not consistent

over updates, unlike the values in the training set. Indeed, the correlations under the test set decrease

monotonically over parameter updates. When the model overfits in the 9th parameter update, the

correlation in the test set for the items has dropped from 0.3440 to 0.2422, while the correlation in the

test set for the users has dropped from 0.6391 to 0.5965. The large difference between the training and

test set, both in initial values and in the magnitude of the drop over iterations, is further evidence that

the robust model is overfitting and not generalizing to the test set.

Again, this overfitting with the fully robust model was observed only with variational inference. We

ran similar truncated precision experiments with the Gibbs sampler. We did not find that inference

with the Gibbs sampler was consistently improved by using truncated precisions relative to the robust

precision model. In addition, we did not find that inference with the Gibbs sampler was consistently

impacted by using truncated precisions relative to the robust precision model. This is not unexpected

given that the histograms and traceplots of the precisions in Figure 3.3 (a)-(b) indicate the precisions

remain at O(1) values under the Gibbs sampler.

The distributional forms of the precisions are identical for the Gibbs sampler and the variational

algorithm. The only difference is in the update. Under Gibbs sampling, the scaled training error is the

conditional mean, from which a sample is drawn. Under the variational algorithm, the scaled training

error is the conditional mean, and is imputed as the update. With respect to the overfitting problem,

the extra noise introduced by the Gibbs samples appears to be preventing overfitting with the precisions.


Figure 3.5: (a) Scatterplot of user/item level errors and precisions after the third full parameter updatein the variational algoirthm. (b) Correlation between (log transformed) user/item level errors andprecisions over parameter updates.

(a)

0 0.5 1 1.5 2 2.50

1

2

3

4

5

6

User Train Error

Us

er

Pre

cis

ion

Scatterplot of User−Level Errors and Precisions (VI) − Training Set

0 0.5 1 1.5 2 2.50

1

2

3

4

5

6

User Test Error

Us

er

Pre

cis

ion

Scatterplot of User−Level Errors and Precisions (VI) − Test Set

(b)

0 5 10 15 20 25 30 35 40 45 500.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

Co

rre

lati

on

User / Train

User / Test

Item / Train

Item / Test

3.6.5 Side Features with Precisions

When side features are included in the model, the incremental gain from using a robust precision model

is lost, see Table 3.1. The Gibbs sampler converges to the same RMSE value for both the constant and

robust precision model.

A closer examination of the test error over iterations show subtle differences in how the common

converged value is obtained. In Figure 3.6, we observe the constant model sees a more substantial drop

in the first 50 iterations, after which the incremental gain is minor. The robust model takes longer to

converge, outperforming the constant model after approximately 80 iterations.

We wish to compare the result of including side features in the matrix factorization model, Table 2.3

in Chapter 2, to the result of including precisions in the model, Table 3.2. We see that the largest

relative improvement of 0.42% by the precisions for the top 10% of users is comparable to the gains

made by the inclusion of side features for the first five bins, corresponding to half of the MovieLens test

set. This highlights the importance of a model to make accurate predictions for rare users. Significant

gains overall may be the result of gains for a small selection of users, as is the case for the inclusion of

precisions in the model. Alternatively, significant gains overall may be the result of gains for a larger

selection of users, as is the case for the inclusion of side features in the model.

Further, the relative changes in Table 3.2 are not uniformly an improvement for the model with

precisions over the model without. For some sets of users, such as the most frequent, the model with

precisions improves over the baseline. For other sets of users, such as those with approximately 200

ratings, the model with precisions performs worse than the baseline. This highlights that the inclusion

of precisions is not necessarily benefiting frequent or infrequent users. Indeed, it is not clear how

precisions are influencing predictive power in any systematic way.


Table 3.2: Test RMSE broken down by user frequency under Gibbs sampling for the models with andwithout user / item precisions when side features are included.

Error RelativeNumber of Ratings Constant Robust Change (%)

≤ 25 0.9035 0.9031 +0.0526− 71 0.8608 0.8615 −0.08

72− 146 0.8494 0.8478 +0.19147− 171 0.8459 0.8449 +0.11172− 301 0.8155 0.8165 −0.12302− 484 0.8243 0.8250 −0.08485− 829 0.8107 0.8098 +0.10

830− 2, 313 0.7474 0.7443 +0.42

0 100 200 300 400 5000.84

0.842

0.844

0.846

0.848

0.85

Sample

Test

RM

SE

Constant

Robust

(a)

Figure 3.6: Test RMSE for the Gibbs sampler for the models with side features under the constant androbust precision model.

3.7 Conclusion

The variational algorithm exhibited pathological behaviour with respect to user and item precisions. In

optimizing the variational lower bound, the algorithm drove a subset of precisions to arbitrarily small

values, and another subset to arbitrarily larges values. Based on this, we investigated if bounding the

precisions had any influence on predictive performance. We replaced the Gamma priors by truncated

Gamma priors, and compared the performance of the variational algorithm for different bounds. In

changing the precision bounds monotonically, a non-monotonic change in the performance of the trun-

cated models was observed over the constant model. It was noted that some bounds do outperform both

the constant and the robust precision models. Further work could investigate automated ways to select

the precision bounds.

We noted that the gain from modeling user and item level precision is not significant when we move

to a model class that includes side features. The same predictive performance is obtained overall. In

addition, there is near equality in predictive performance within sets of users of different frequency.

Chapter 4

Meta-Constrained Latent User

Features

4.1 Introduction

The low rank matrix assumption for collaborative filtering starts with a (user, item) matrix of preferences

R ∈ RN×M and factorizes it as the product of two low rank matrices R = U>V , where U ∈ Rd×N , V ∈Rd×M . Each column in U is the latent feature of a user, each column in V the latent feature of an item,

and the ri,j entry can be reconstructed as the inner product U>i Vj , where Ui is the ith column of U ,

and Vj is the jth column of V. The problem of estimating U and V can be approached as an incomplete

SVD problem.

In the probabilistic framework, ri,j is modeled as a Gaussian with mean U>i Vj , and each column of

U, V is has an independent Gaussian prior placed on it. The mean is therefore a linear combination of

the inner product of user and item feature components. The strength of this method therefore depends

on the ability of the latent feature model to capture concrete features of the items. Prior work has

shown that explicitly incorporating auxiliary information about individuals can be very predictive of

personality traits [16]. As personal taste in movies, items, and social networks are aspects of personality,

this auxiliary information may be predictive in the collaborative filtering context.

In this chapter, we consider using auxiliary information on users to introduce constraints between

the user features. The constraint differs from the vanilla constrained probabilistic matrix factorization

model reviewed and extended in Chapter 2 [32]. The previous extension was based on additional latent

features, one for each time, while our method is based on incorporating select auxiliary information

on users. The difference is illustrated by considering the relationship of the side features to the user

features and to the rating. In the vanilla constrained PMF model, the user features are independent

of the side features in the prior, and the rating is dependent on the side features in the prior. In our

proposed model, the user features are dependent on the side / meta features in the prior, and the rating

is conditionally independent of the side features given the user features. A graphical model comparing

the proposed model to the vanilla constrained PMF is in Figure 4.1. Solid lines and nodes are present in

both models, dashed lines / nodes are present in the vanilla constrained PMF model, and dotted lines

/ nodes are included in the meta constrained model.

Relative to the vanilla constrained PMF model, our proposed model has lower time and space com-

41

Chapter 4. Meta-Constrained Latent User Features 42

Figure 4.1

ri,j

i = 1 : N

µU

ΛU

Ui

fi

µγλγ

γij = 1 : M

µV

ΛV

Vj

µη λη

ηj

k = 1 : dm

Wk

µW

ΛW(µ0, β0)

(W0, ν0)

τ

aτ bτ

plexity. For a low rank matrix factorization model in d dimensions with N users M items, and dm

auxiliary features, the vanilla constrained PMF model has (N + 2M)d parameters to sample, while the

proposed model has (N + M + dm)d features, where dm is the number of user meta features included

in modeling. We would expect that M dm, since there are typically many more items in the system

than auxiliary user features that would be relevant for modeling purposes. In addition, estimation of the

auxiliary user features in our proposed model admits a simpler form than the side features in the vanilla

constrained PMF model. In particular, the posterior sampling distribution for Gibbs does not require

pooling ratings over both users and items, and matrix calculus admits a simple gradient to update all

user meta features in parallel.

4.2 Exploratory Analysis

Prior to developing a probabilistic recommendation model to incorporate personal user attributes, it is

prudent to determine if such personal attributes may actually be informative in the recommendation

framework. To answer this, we looked at the MovieLens 1M data set. This data set has frequently been

used as an experimental data set for recommendation systems, including matrix factorization models.

This data set includes several personal attributes on users, including age, gender, occupation, and zip

code. In the provided data set, both age and occupation are categorized. Age is binned into intervals,

and occupation is binned into groups. See Section 4.5 for more details on the auxiliary user information.

Much work in the literature has ignored these labels on user, focusing on purely latent based matrix

factorization models.

To investigate if user attributes may be beneficial to incorporate in the probabilistic recommendation

framework, we learned a baseline matrix factorization model ri,j = U>i Vj consisting of user and item

features only. Learning was achieved through batch gradient descent on γi, ηj , Ui, Vj on a sum-of-squared

error term with quadratic regularizers on the features and offsets:


` =

N∑i=1

M∑j=1

Ii,j(ri,j − ri,j)r

+ λU

N∑i=1

U>i Ui + λγ

N∑i=1

γ2i

+ λV

M∑j=1

V >j Vj + λη

M∑j=1

η2j .

(4.1)

Two dimensional latent features Ui, Vj were used so that the resulting MAP estimate could be easily

visualized.

Given the two-dimensional MAP estimate for the latent features, we extracted meta information on

the users corresponding to age, gender, and occupation. This yielded for each user i a binary label vector

fi ∈ 0, 1dm , where dm is the amount of meta information extracted. Under the hypothesis that these

labels were informative, we would expect the user features Ui to have different distributions for different

labels fi. In the probabilistic framework, this can be formalized as follows. Let Zi ∈ N be a cluster label

for user i. Each assignment Zi = zi uniquely corresponds to a configuration for a binary label vector fi.

The distribution of Ui is then multivariate Gaussian, conditional on Zi,

(Ui | Zi) ∼ N (Ui | µU (Zi),ΛU (Zi)). (4.2)

Equivalently, since Zi is uniquely determined by fi

(Ui | Zi) ∼ N (Ui | µU (fi),ΛU (fi)). (4.3)

To qualitatively explore this, we group the latent user features from the MAP estimate based on

different configurations of the label vectors fi. For each group, we obtained maximum likelihood es-

timates of the mean and precision matrix. In the MovieLens data set, there are approximately 250

different configurations of fi among the 6, 040 users. This yields estimates of the distribution we define

in Equation (4.3). We plot two typical subsets of these estimates in Figure 4.2. From these plots, we

can see there is qualitative evidence to suggest that both the mean and the precision for the user’s latent

feature can vary depending on the configuration fi. In the context of recommendation systems, different

users can have clusters of preferences, and hence should have clusters of latent features, based on user

meta information.

For d-dimensional latent feature vectors, the mean requires O(d) parameters and the precision matrix

requires O(d2) parameters. Many configurations fi can be rare, as they uniquely determine a specific

combination of user traits. For instance, “males aged 20-29 who are currently students” is represented

by a different fi than “males aged 20-29 who are currently baristas”. Given this, we will consider a

simplified probabilistic model where only the mean is influenced by the user meta data fi.

4.3 Model

We discuss the model in the context of a recommendation system for items to users, as that is a

common application and also the context of the data sets used in this thesis. However, the model can

be generalized to the setting of preference matching between two arbitrary sets.


Figure 4.2: MovieLens data set: Two prototypical examples of the distribution of latent user featuresobtained from a two-dimensional MAP estimate, when grouped by certain user meta information (age,gender, occupation). In both subsets, we can qualitatively see difference in both the mean and precisionfor the latent user features in each group.

(a)

−3 −2 −1 0 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

(b)

−3 −2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 4.3: Bayesian Meta Constrained Probabilistic Matrix Factorization with Gaussian-Wishart Priorsover the latent user, item, and side feature vectors. The user precision factors αi and item precisionfactors βj allows for non-constant variance in the observed preference ri,j . The extension to scaled

ri,j

i = 1 : N

µU

ΛU

Ui

fi

µγλγ

γij = 1 : M

µV

ΛV

Vj

µη λη

ηj

k = 1 : dm

Wk

µW

ΛW(µ0, β0)

(W0, ν0)

τ

aτ bτ


Suppose to have a system with N users and M items, where each user provides a rating on a subset

of items. If user i provides a rating for item j, we denote the rating by ri,j . These ratings form a sparse

rating matrix R ∈ RN×M . Users index rows and items index columns in this rating matrix. The matrix

is sparse since each user i typically rates a subset Ni ⊂ 1, . . . ,M of items, where |Ni| M .

Standard matrix factorization methods find a low rank approximation to the rating matrix. Each

row / user i is assigned a latent feature vector Ui, each item j is assigned a latent feature vector Vj ,

and the rating by user i of item j is reconstructed by the inner product U>i Vj . In the probabilistic

framework, the rating is modeled as Gaussian conditional on the features, and the features are also

modeled as Gaussian,

(ri,j | Ui, Vj) ∼N (ri,j | U>i Vj , τ)

(Ui | µU ,ΛU ) ∼N (Ui | µU ,ΛU )

(Vj | µV ,ΛV ) ∼N (Vj | µV ,ΛV ).

(4.4)

The Bayesian extension places Gaussian-Wishart priors on the feature means and precision matrices.

(µU ,ΛU ) ∼N (µU | µ0, β0ΛU ) · W(ΛU | ν0,Λ0)

(µV ,ΛV ) ∼N (µV | µ0, β0ΛV ) · W(ΛV | ν0,Λ0).(4.5)

Suppose we have auxiliary information for the users. We assume that this information can be

categorized and encoded in a binary vector fi ∈ 0, 1dm×1 for each user i. While this may appear

restrictive, it is the case for several popular examples (ex: gender, Facebook Likes, post-secondary

education, Twitter followers, etc). For continuous information (ex: age), one can discretize it sensibly

and obtain a binary vector. We will use this auxiliary information to constrain the prior means for

the user feature vectors. For convenience, we can stack the fi in a matrix as columns to obtain an

dm × N matrix f . This matrix notation will be using in deriving the gradient descent updates. In f ,

each column provides all the information for a single user, and each row provides the information for a

single attribute for all users. From this point, we will refer to the auxiliary features, or user attributes,

as a meta feature.s

For each meta feature k ∈ 1, . . . , dm, let Wk ∈ Rd be an associated latent feature for the presence

/ absence of the meta feature. We place the same prior distribution on Wk as on the user and item

features:

(Wk | µW ,ΛW ) =N (Wk | µW ,ΛW )

(µW ,ΛW ) ∼N (µW | µ0, β0ΛW ) · W(ΛW | ν0,Λ0).(4.6)

We place the same Gaussian-Wishart prior on the these side features as placed on the user and item

feature in Equation (4.5).

With respect to the PMF model, the Wk are used to shift the prior mean for the user features Ui.

The net effect is that users with the same meta information a priori should have features with the same

mean. In the probabilistic framework, we encode this with the distribution for Ui as

(Ui | µU ,ΛU ,W1:dm , f1:N ) ∼N

(Ui

∣∣∣∣ µU + ‖fi‖−1dm∑k=1

Wkfk,i,ΛU

). (4.7)

Here fk,i is the (k, i) element of f . Note that if a user has no meta information, then the sum is


empty, we define the prior mean to be µU , and the model reverts to the vanilla PMF model.

With this, the model is defined as

(ri,j | · · · ) ∼N (ri,j | U>i Vj)

(Ui | · · · ) ∼N (Ui | µU + ‖fi‖−1∑k

Wkfk,i,ΛU )

(Vj | · · · ) ∼N (Vj | µV ,ΛV )

(Wk | · · · ) ∼N (Wk | µK ,ΛK)


(µV ,ΛV ) ∼N (µV | µ0, β0ΛV ) · W(ΛV | ν0,Λ0)

(µW ,ΛW ) ∼N (µW | µ0, β0ΛW ) · W(ΛW | ν0,Λ0).

(4.8)

In Chapter 2, the importance of user and item offsets was discussed. These users and item offsets

were used in the modeling there and in Chapter 3. For simplicity, user and item offsets are not used in

the model for these experiments. However, we briefly outline a way in which the user offset can be tied

through this additional user information.

Let ω ∈ Rdm×1 be a vector with distribution

(ω | µω0,Λω0

) ∼N (ω | µω0,Λω0

). (4.9)

Using this vector, we define the user offset as

(γi | µγ , λγ , fi, ω) =N (γi | µγ + ω>fi, λγ). (4.10)

With these offsets included, the rating ri,j would be modeled as

E[ri,j ] = γi + ηj + U>i Vj . (4.11)

4.4 Inference

This extension of the vanilla PMF model is conjugate, which permits the use of Gibbs sampling for

inference. Many Gibbs sampling distributions are identical to the vanilla PMF model, by conditional

independence. The derivations that follow are provided in detail in Appendix E.

The user features have a new mean, to reflect the shift from the side features:

(Ui | ri,j , · · · ) ∼N (Ui | µUi , ΛUi)

ΛUi =ΛU + τ

N∑j=1

Ii,jVjV>j

µUi =Λ−1Ui

τ N∑j=1

Ii,jri,jVj + ΛU

(µU − ‖fi‖−1

N∑k=1

Wkfk,i

) .(4.12)

The sampling distribution for the meta features Wk in this model is also Gaussian,


Table 4.1: Number of ratings, users, and items for the MovieLens and Flixster data sets used for modeltesting.

MovieLens FlixsterNumber of Ratings 1,000,209 8,196,077

Number of Users 6,040 147,612Number of Items 3,952 48,794

(Wk|ri,j , · · · ) ∼N (Wk | µWk, ΛWk

)

ΛWk=Λw + ΛU

N∑i=1

‖fi‖−2f2k,i

µWk=Λ−1Wk

ΛWµW + ΛU

N∑i=1

‖fi‖−1fk,i

Ui − µU − ‖fi‖−1∑j 6=k

Wjfj,i

.(4.13)

The sampling distribution for (µU ,ΛU ) is identical to the baseline Bayesian PMF model, with Ui

replaced by the “shifted” equivalent Ui − ‖fi‖−1∑dmk=1Wkfk,i.

If considering the model with user and item offsets, the sampling distribution for γi is Gaussian,

parametrized as

(γi | · · · ) ∼N (γi | µγi , λγi)

where λγi =λγ + τ

M∑j=1

Ii,j

and µγi =λ−1γi

τ M∑j=1

Ii,j(ri,j − ηj − U>i Vj) + λγ(µγ − ω>fi)

.(4.14)

The sampling distribution for the offset vector ω is

(ω | · · · ) ∼ N (ω | µω, Λω)

where Λω =Λω0+

N∑i=1

λγifif>i

and µω =Λ−1ω

[Λω0

µω0+

N∑i=1

fi(γi − γ0)λγi

].

(4.15)

Again, the experimental results reported here do not include user or item offsets. These derivations

are included for completeness.


We make use of the MovieLens 1M data set1 and the Flixster data set2 for experimentation. Statistics

on these data sets are given in Table 4.1.

1http://grouplens.org/datasets/movielens/2http://www.cs.ubc.ca/~jamalim/datasets/

http://grouplens.org/datasets/movielens/

http://www.cs.ubc.ca/~jamalim/datasets/


Additional sparsity was created by using a small portion of data for training. Training used 10%

of the MovieLens data and 15% of the Flixster data set. Gibbs sampling was initiated using MAP

estimates obtained as described in Section 4.7. An appropriate set of tuning parameters were used to

obtain suitable MAP estimates, but no exhaustive grid search was performed.

The small amount of training data highlights the impact of the more complicated prior in the absence

of rating behaviour of the users. Experiments were additionally performed with larger amounts of

training data. The results indicated that the proposed model and the baseline model produced nearly

identical test RMSE values with larger amounts of training data.

We used the following choices of hyper-parameters. For the Gaussian-Wishart priors on the user

feature vectors, (µU ,ΛU , β0, ν0) = (0d×1, Id, 1, d + 1). The mean value was chosen to reflect that the

features are mean zero after accounting for the biases, while the values for the scale matrix and degrees

of freedom were selected to give a vague prior that was still proper. This was chosen for consistency

with previous experimental results. Identical choices were made for the item and meta features. User

offsets and item offsets were not included in these experiments.

4.6 User Meta Information

The MovieLens 1M data set contains gender, age, occupation, and geographical information (zip code)

on the users. Age was ordinal in seven mutually exclusive ranges. Occupation was categorical in 21

categories. For each user i, we defined fi as a binary bit vector for gender, age, and occupation. We

excluded zip code for these results. This produces a binary (user, demographic) matrix of auxiliary meta

information for the 6,040 users across 30 demographics.

The Flixster data set provides gender, location, the date the user joined, the last login date, the

number of profile views, and age. Some of this data is missing for some users (ex: age is not always

available), and some is meaningless. For instance, the data format suggests that location was a free-form

response, as some values include “my house”, “on the earth”, “Texas”, “Independence Avenue”, and “its

[sic] a secret :)”. For our experiments, we first removed any user with missing age, and then used gender

and age for auxiliary information. We binned age into intervals of size 10 starting from the youngest to

oldest age in the data set. This yielded categorical bins for age of 10− 19, 20− 29, . . . , 110− 120. This

produces a binary (user, demographic) matrix of auxiliary meta information for the 147,612 users across

13 demographics.

4.6.1 PCA for User Meta Information

The categories of meta information are highly dependent as several pairs are mutually exclusive. For

instance, a user cannot have multiple declared ages. This means there is structure in the meta demo-

graphic matrix to consider. Following work in the literature for similar information [16], an experiment

was performed for each data set making use of the output of principal components. Principal compo-

nent analysis was used as an additional optional feature engineering step on the demographic matrix for

each data set. The original user demographics fi were projected onto the first k principal components,

where k was selected based on a screeplot of the eigenvalues of the demographic correlation matrix. The

screeplots for these two data sets are in Figure 4.4 for reference. The projected fi were then used as

input into the model.


Figure 4.4: Screeplots of the MovieLens 1M and Flixster meta information

(a) MovieLens 1M

0.1

0.2

0.3

0.4

MovieLens 1M

Number of PCs

Var

ianc

e

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(b) Flixster

0.0

0.1

0.2

0.3

0.4

0.5

Flixster

Number of PCs

Var

ianc

e

1 2 3 4 5 6 7 8 9 10

As in other applications, the use of PCA can significantly reduce the number of explanatory variables

needed by the model. In our particular case, we report results using the first six principal components

for each data set. This is a reduction of 80% from the original 30 variables in MovieLens, and a reduction

of 54% from the original 13 variables in Flixster.

4.7 MAP Estimate

We start from a MAP estimate for the offset and the latent features. This MAP estimate is obtained by

batch gradient descent on the following objective function

`(γ1:N , η1:M , ω, U1:N , V1:M ,W1:dm) ≡`

=

N∑i=1

M∑j=1

Ii,j(ri,j − U>i Vj)2

+ λU

N∑i=1

(Ui − ‖fi‖−1Wfi)>(Ui − ‖fi‖−1Wfi)

+ λV

M∑j=1

V >j Vj

+ λW

dm∑k=1

W>k Wk.

(4.16)

By letting W ∈ Rd×dm , where the kth row of W is Wk, Equation (4.16) can be equivalently expressed

as


` =

N∑i=1

N∑j=1


+ λU

N∑i=1

(Ui − ‖fi‖−1Wfi)>(Ui − ‖fi‖−1Wfi)

+ λV

M∑j=1

V >j Vj

+ λW

dm∑k=1

Trace(W>W).

(4.17)

The gradients with respect to Ui, Vj , and the matrix W are:

δL

δUi=−

M∑j=1

Ii,j(ri,j − U>i Vj) Vj + λU (Ui − ‖fi‖−1Wfi)

δL

δVj=−

N∑i=1

Ii,j(ri,j − U>i Vj) Ui + λV Vj

δL

δW=− λU

N∑i=1

(Ui − ‖fi‖−1Wfi) ‖fi‖−1f>i + λWW.

(4.18)

Note that this formulation allows us to update all the Wk simultaneously.


Table 4.2: MovieLens. PCA used the first six principal components.

Frequency Baseline Meta Relative Change PCA Meta Relative ChangeOverall 0.9292 0.9221 -0.76% 0.9243 -0.53%[0− 1) 1.0006 1.0005 -0.01% 1.0039 0.33%[1− 2) 1.0201 1.0094 -1.05% 1.0164 -0.36%[2− 3) 1.0185 1.0143 -0.41% 1.0174 -0.11%[3− 4) 0.9872 0.9857 -0.15% 0.9857 -0.15%[4− 5) 0.9585 0.9536 -0.51% 0.9577 -0.08%[5− 10) 0.9665 0.9604 -0.63% 0.9627 -0.39%[10− 25) 0.9402 0.9343 -0.63% 0.9359 -0.46%[25− 50) 0.9151 0.9077 -0.81% 0.9057 -0.59%[50− 100) 0.9053 0.8973 -0.88% 0.8998 -0.61%100+ 0.9446 0.9441 -0.05% 0.944 -0.06%

Table 4.2 lists the test RMSE values obtained using the MovieLens 1M data, and Table 4.3 lists the

test RMSE values obtained using the Flixster data.

For both data sets, the model with meta constraints (‘Meta’) for the users outperforms the baseline

PMF model (‘Baseline’), both overall in the test set and for sets of user with different frequency of

ratings in the training set. The relative changes are all negative, reflecting a lower test RMSE for the

model with meta constraints relative to the baseline PMF model. These relative changes are small,

but on the same order of magnitude as the relative changes for the original constrained PMF model in


Table 4.3: Flixster. PCA Used the first six principal components.

Frequency Baseline Meta Relative Change PCA Meta Relative ChangeOverall 0.9066 0.8952 -1.26% 0.8953 -1.25%[0− 1) 1.1187 1.1038 -1.33% 1.1093 -0.84%[1− 2) 1.0435 1.0362 -0.70% 1.0389 -0.44%[2− 3) 0.999 0.9916 -0.74% 0.994 -0.50%[3− 4) 0.982 0.9737 -0.85% 0.9763 -0.58%[4− 5) 0.9536 0.9452 -0.88% 0.9467 -0.72%[5− 10) 0.9546 0.9447 -1.04% 0.9455 -0.95%[10− 25) 0.9429 0.9315 -1.21% 0.9324 -1.11%[25− 50) 0.918 0.9065 -1.25% 0.9062 -1.29%[50− 100) 0.8844 0.8737 -1.21% 0.8736 -1.22%100+ 0.7998 0.784 -1.98% 0.7827 -2.14%

Chapter 2.

Larger gains are seen for the Flixster data set relative to the MovieLens data set. This suggests that

the auxiliary information might be more relevant for predicting ratings in the Flixster data set.

The results of the third experiment with the demographics projected onto principal components are

reported in Tables 4.2 - 4.3 under the ‘PCA Meta’ column. There are some important observations.

First, the PCA approach produces test RMSE values almost always better than the baseline model.

The sole exception are users with no ratings in the MovieLens data set. For this group of users, the

PCA approach produces a test RMSE that is 0.33% larger than the baseline.

Second, the relative changes are generally smaller in magnitude when compared to the relative

changes of the ‘Meta’ model. This is to be expected, as the PCA approach uses only the first k = 6

principal components for each data set. This dimensionality reduction from the original 30 variables for

MovieLens, and 13 variables for Flixster, produces information loss. Still, it is encouraging that working

with a reduced set of features generated through PCA can produce results with favourable test results.

The overall test error for both data sets is in the left column of Figure 4.5. For MovieLens, the test

error under the baseline, meta, and PCA models start at approximately the same numerical value. The

overall test error under the ‘Meta’ model drops at a clearly larger rate than either the baseline or the

PCA model. This rapid drop reflects the better mixing that the meta constraints are able to allow for.

For Flixster, the overall test error is even more powerful. The bottom-left panel in Figure 4.5

illustrates that both the ‘Meta’ and ‘PCA’ models initially have a higher overall test RMSE. However,

the test RMSE under both these models drop much more quickly relative to the baseline model, and

converge to a clearly lower value. This further illustrates the benefit of the proposed model.

In the MovieLens data set, the test RMSE for cold start users drops for all three models. As

mentioned, the cold start error under MovieLens is lowest under the baseline model. Figure 4.5 (top-

right) illustrates this. However, this figure also illustrates that the ‘Meta’ model drops in test error sooner

and more rapidly than the baseline model. As the relative increase for the ‘Meta’ model relative to the

baseline model is 0.01%, the significance (both practical and statistical) of this increase is questionable.

In the Flixster data set, the baseline model is unable to improve over the MAP estimate for cold

start users. However, the ‘Meta’ model is able to improve over the MAP estimate, and over the baseline

model. This is again despite the MAP estimate for the ‘Meta’ model producing worse test error than

the baseline model. The cold start error for the PCA approach initially produces an error rate similar

to the baseline model. However, the PCA approach is able to improve on this MAP estimate error rate,


Figure 4.5: (Left) Overall test RMSE and (right) RMSE on the cold start user subset for (top) MovieLens1M and (bottom) Flixster data sets.

(a) MovieLens - Overall

0 20 40 60 80 1000.9

0.95

1

1.05

1.1

1.15

Iteration

RM

SE

baseline

meta

SVD

(b) MovieLens - No Ratings

0 20 40 60 80 100

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

1.16

Iteration

RM

SE

baseline

meta

SVD

(c) Flixster - Overall

0 20 40 60 80 1000.85

0.9

0.95

1

1.05

1.1

Iteration

Te

st

RM

SE

baseline

meta

SVD

(d) Flixster - No Ratings

0 20 40 60 80 1001.1

1.12

1.14

1.16

1.18

1.2

1.22

Iteration

RM

SE

baseline

meta

SVD

in contrast to the baseline model error rate which does not improve.

Similar observations hold for users with few ratings in the system. Figure 4.6 illustrates the change

in test error for users with one rating (left panel) and two ratings (right panel) for both data sets. The

relative performance and rates of drops are similar to the case of users with no ratings and the entire

test set. The plots for Flixster indicate that even one rating in the system is enough to produce a

smooth decline in test error for all three models. In particular, the test error for the baseline model now

decreases smoothly from over 1.15 to 1.1187. However, the Meta model and PCA model still outperform

the baseline, achieving a test error of 1.1038 and 1.1093, respectively.

4.9 Conclusion

In this chapter, we started with an exploratory analysis of the two-dimensional user feature vectors

obtained from MAP estimation for a data set. We performed supervised clustering of these user feature

vectors using labels generated from user demographic information. Based on distributional differences

observed, we proposed a model that uses this demographic information to model the observed distribu-

tional difference.

Our proposed model outperforms the baseline PMF model with respect to test RMSE overall, and

for users with different frequency in the system at training time. In particular, our model improves

predictive accuracy for users with no ratings in the system.


Figure 4.6: Test errors for users with few ratings in the system.

(a) MovieLens - 1 Rating

0 20 40 60 80 1001

1.05

1.1

1.15

1.2

Iteration

Te

st

RM

SE

baseline

meta

SVD

(b) MovieLens - 2 Ratings

0 20 40 60 80 1001

1.05

1.1

1.15

1.2

Iteration

Te

st

RM

SE

baseline

meta

SVD

(c) Flixster - 1 Rating

0 20 40 60 80 1001

1.05

1.1

1.15

1.2

1.25

Iteration

Te

st

RM

SE

baseline

meta

SVD

(d) Flixster - 2 Ratings

0 20 40 60 80 1000.95

1

1.05

1.1

1.15

1.2

1.25

Iteration

Te

st

RM

SE

baseline

meta

SVD


Mirroring existing work in the literature for similar problems, we proposed a modified model that

uses the demographic information projected onto the first k principal components. The number of

principal components was selected heuristically based on bends in screeplots. This approach showed

mixed results. For the MovieLens data set, the results were similar to the baseline model. For the

Flixster data set, the results were similar to the original ‘Meta’ model proposed. This difference in

performance may be explained by the dimensionality reduction being much larger for MovieLens than

for Flixster, in turn meaning a higher reconstruction error for MovieLens. The bend in the screeplots

suggested k = 6 principal components for both data sets, which is a reduction in variables by 80% for

MovieLens, compared to 54% for Flixster.

Future work could explore the sensitivity of the PCA approach to the number of components. In

addition, the initial EDA of the two-dimensional user feature MAP estimates suggested there may be

distributional differences in the covariance matrix. In the proposed model, only differences in the mean

was modeled using the demographics. Future work can look at similar constraints for the covariance

matrix.

Chapter 5

A Generative Model for User

Network Constraints in Matrix

Factorization

5.1 Introduction

At the core of recommender systems are two sets of objects, with the goal of recommending items in one

set to items in the other set. In some applications, these two sets of objects are different (recommending

videos to users on YouTube, recommending apps to users on Google Play), while other applications

have the set of objects be the same (recommending people to people in social networking sites such as

Facebook, LinkedIn, OkCupid, etc.). Collaborative filtering systems propose to solve this problem by

considering preferences of other similar users to make recommendations.

The low rank matrix assumption for collaborative filtering starts with a (user, item) matrix of

preferences R ∈ RN×M and factorizes it as the product of two low rank matrices R = U>V , where

U ∈ Rd×N , V ∈ Rd×M . Each column in U is the latent feature of a user, each column in V the latent

feature of an item, and the ri,j entry can be reconstructed as the inner product U>i Vj , where Ui is the

ith column of U , and Vj is the jth column of V.

In the probabilistic framework, ri,j is modeled as a Gaussian with mean U>i Vj , and each column of

U, V is has an independent Gaussian prior placed on it. The Bayesian extension creates a hierarchical

model by adding a Gaussian-Wishart prior for the Gaussian mean and precision. While computationally

convenient, this model assumes prior independence between the users and the items. In many practical

applications of collaborative filtering, there exists an underlying network defined between users. These

networks have been discussed in the literature as both social networks and trust networks. In the case of

social networks, an edge between users suggests social similarity. In the case of trust networks, an edge

between users suggests a trust relation. While these are both networks between users, it is important to

highlight the differences between the two.

Mathematically, a network between N users can be defined by a graph

55

Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization56

G =(V, E)

where V =vi |; i ∈ 1, . . . , N

and E =E = ai,j | i, j ∈ 1, . . . , N.

(5.1)

In this notation, V is a set of N vertices / nodes, and each corresponds to a user in the system. E is

an indexed collection of edges between the N users. We further define the adjacency matrix

A =(ai,j)(i,j)∈1,...,N, (5.2)

the (i, j) entry of this matrix, ai,j indicates the relationship between users i and j in the network,

corresponding to nodes i and j in the graph.

In these user networks, the network may be either directed or undirected. In the case of Facebook,

where the social tie between two users (ui, uj) is symmetric, the network is undirected. With respect

to the adjacency matrix A, this means that ai,j = aj,i. In the case of Google Plus, where a user ui

can add another user uj to a “circle” without the need for a reciprocal action, the network is directed.

Again, with respect to the adjacency matrix A, this means that ai,j and aj,i may be different. To use

colloquial terms, to “friend” someone is an undirected edge, while to “follow” someone is a directed

edge. The literature commonly assumes that trust networks are directed. For undirected networks, the

adjacency matrix is necessarily symmetric. The adjacency matrix for directed networks is not necessarily

symmetric.

In addition to trust, user networks can contain distrust statements. One example would be a “block

list” on social networking sites, or marketing sites. There has been recent work on incorporating distrust

statement in the matrix factorization model [5], but this work is not in the Bayesian context. The high

level idea is that users with positive relations should have more similar features than users with negative

relations. Specifically, consider three users u1, u2, u3 such that u1 and u2 have a favourable relationship

in the network (ex: a trust relationship or friendship), but u1 and u3 have an unfavourable relationship

in the network (ex: a u1 blocked u3). Then for some general loss function ` and distance metric d, the

optimization problem includes the penalty

` (d(Ui, Uj)− d(Ui, Uk)) (5.3)

in addition to the log-likelihood for the ratings and the standard regularization terms for the features.

Previous work has considered matrix factorization with social regularization [15]. In this framework,

the L2 norm of each user feature Ui was constrained to be simultaneously close to zero and also close

to the average of the features for the users that user i was connected to in the social network. Gradient

descent on an objective function with dual user regularization showed that this improved predictive

performance for users with few or no ratings in the system.

Inference for these gradient-based methods performed well. However, these models will not extend

easily (if at all) to a fully generative framework. To illustrate this, we will outline in this chapter a

pseudo-extension, the most natural one given the existing framework. We will demonstrate how this

pseudo-extension leads to pathological issues with inference, which results in poor test performance. As

an alternative, we propose a fully generative alternative that shares the same social network dependency.


Our experimental results demonstrate the proposed model achieves comparable test performance under

MAP estimates obtained from gradient descent. We further illustrate that the fully Bayesian extension

outperforms the MAP estimates, reinforcing the results of previous chapters that fully generative models

achieve lower test error relative to the estimates achieved from gradient descent.

To outline, our contributions in this chapter are:

• We review the existing social network based matrix factorization models which are not properly

generative. We conjecture a pseudo-extension fully generative model, and illustrate pathological

inferential issues with this pseudo-extension;

• We propose a fully generative model that mimics the same social network dependency as the

existing non-generative counterparts;

• We demonstrate that our proposed generative model achieves comparable test error with MAP

estimates obtained from gradient descent as the existing non-generative counterpart;

• We demonstrate that our proposed generative model does not suffer from pathological inferential

issues under Gibbs sampling, and also achieves lower test error relative to the MAP estimate.

5.2 Previous Work

Existing work for matrix factorization with social regularization has been considered in the literature.

Some work considered a joint factorization of a user-item rating matrix R and a user-user social graph

G. The low-rank user features were shared between the two factorizations, imposing the social constraint

in the model [21]. The drawback to this model is that the rating matrix and the social graph are on

different scales. Typically, the entries in the rating matrix are ordinal in some subset of the integers.

Conversely, the social graph matrix is typically binary. This casts the users features into two matrices

that may be on a significantly different scale.

Follow up work focused on a model for the user-item rating matrix R, and no probabilistic model

for the social graph G [20]. Social regularization was introduced by modifying the predicted user-item

rating ri,j to be a convex combination of local user influence and social network influence. Using the

notation previously defined, mean of the Gaussian for the rating is now

E[ri,j ] =αU>i Vj + (1− α)∑k

ai,kU>k Vj , (5.4)

where ai,k is the entry of the user adjacency matrix, as defined in the introduction to this chapter,

Equation (5.2).

This is a convex combination between a given user’s tastes in the first term and the user’s first-

degree neighbours in the second term. The parameter α was a tuning parameter that governed the

trade-off between individual taste and social taste in the prediction. While intuitively simple, there is

a single tuning parameter α shared across all users. In practice, some users may rely more on social

recommendations than others.

More recent work also considered using the social network G as a source of regularization for the user

features[15]. As in previous work, this work used a product of Gaussians to model the distribution of

the user features Ui,


p(U | G, τU , τN ) ∝p(U | τU )× p(U | G, τN )

=

N∏i=1

N (Ui | 0, τUI)N∏i=1

N (Ui | ‖ai,·‖−1∑j

Ujai,j , τNI).(5.5)

Here, we define ‖ai,·‖ =∑j ai,j . The first Gaussian penalizes the user features towards zero, while

the second penalizes towards the average of the user features of the first-degree neighbours.

While intuitively simple, this probabilistic formulation does not correspond to a proper generative

model. The model is circular. In particular, for any given user, the other user features must already exist.

One option would be to first generate the user features in the absence of the user network. Subsequent

generation would then be from the product of the two Gaussians in Equation 5.5. For iterative inferential

methods (including gradient descent, MCMC methods, variational inference), this can be achieved by

randomly generating the user features in the first iteration, and then updating according to the product

of Gaussians.

For this model, the equivalent energy function is

E =1

2

N∑i=1

M∑j=1


+λU2

N∑i=1

U>i Ui

+λN2

N∑i=1

(Ui − ‖ai‖−1N∑j=1

Ujai,j)>(Ui − ‖ai‖−1

N∑j=1

Ujai,j)

+λV2

M∑j=1

V >j Vj .

The gradient with respect to Ui is given by,

∂E

∂Ui=

M∑j=1

Ii,j(ri,j − U>i Vj) + λUUi

+ λN (Ui − ‖ai‖−1N∑j=1

Ujai,j)− λN∑

j:ai,j=1

ai,j(Ui − ‖ai‖−1N∑j=1

Ujai,j).

(5.6)

5.2.1 Pseudo-Generative Extension

As mentioned, the existing model presented is not properly generative as the generation of the user

features is circular. To generate the user feature Ui for any given user, the user features for all first-

degree neighbours must first be generated. These features in turn depend on other features to be

pre-generated, including Ui in the case of a symmetric social network G.

Abusing the generative process, a pseudo-generative model can attempt to be defined according to

the following procedure:

1. First, generate the user features U1, U2, . . . , UN for N users independently;

2. Next, generate the user features U1 conditional on the current set of user features, possibly from

the initial independent generation.


Given this pseudo-generative model, we can attempt to consider inference under Gibbs sampling,

albeit misguided, to see if this might yield any desirable performance.

When this model is placed in the Bayesian probabilistic framework, it can be shown that the Gibbs

sampling distribution for Ui is given by

(Ui | ·) ∼N (Ui | µUi ,ΛUi)

where ΛUi =τ∑j

Ii,jVjV>j + ΛU + ΛU

∑k 6=i

a2k,i‖ak‖2

µUi =Λ−1Ui

τ∑j

Ii,jVjV>j + ΛUµU + ΛU

‖ai‖−1∑j

Ujai,j

+ ΛU

∑k 6=i

ak,i‖ak‖

Uk − ‖ak‖−1∑j 6=i

Ujak,j

.

(5.7)

5.2.2 A Special Case

Consider the case of two users, U1, and U2, who are connected to each other but not connected to any

other user. With respect to the notation introduced, a1,2 = 1, a2,1 = 1, and a1,j = a2,j = 0 for all j > 2.

The gradient of the energy function in Equation (5.2) with respect to U1 reduces to

∂E

∂U1=

M∑j=1

I1,j(r1,j − U>1 Vj) + λUU1

+ λN (U1 − U2)− λN (U2 − U1)

=

M∑j=1

I1,j(r1,j − U>1 Vj) + λUU1.

(5.8)

This is equivalent to the gradient for the standard PMF energy function with respect to U1. In effect,

the two users are not connected. There is no additional penalty imposed for these users.

This reduction to the standard PMF framework does not occur for the sampling distributions defined

for Gibbs sampling. The mean for the sampling distribution in Equation (5.2) for this special two user

case reduces to:

µU1=Λ−1U1

τ∑j

Ii,jVjV>j + ΛUµU + ΛU (U2) + ΛU (U2)

. (5.9)

So the mean for U1 is proportional to the likelihood term from the ratings, a penalization towards

zero, and the feature U2. The prior is now dominated by the likelihood of the ratings and the feature

U2. For rare users, the Gibbs updates for U1, and U2 will continue to cycle around each other, and the

penalization towards zero may be ignored. This can lead to overfitting, as we show in the experimental

results.


5.3 Proposed Model

The existing model is not properly generative, and we have shown that the naive generative extension of

this model leads to pathological inferential behaviour. Therefore, the existing model is limited to gradient

descent methods. Inference relying on proper generative models is not possible for these models. This

is a limitation as there are many inferential methods more power than simple gradient descent that rely

on proper generative models. Indeed, this was the case for extending the constrained PMF model to the

fully Bayesian framework in Chapter 2.

To resolve this pathological behaviour, we extend the hierarchical model for the user features by an

additional layer. In the first layer, each user generates a Gaussian latent feature. Given the generated

set of individual user features, each user generates a second Gaussian feature with a mean that accounts

for the network among users. The probabilistic framework for this proposed model is given by

(ri,j | Ui, Vj , τ) ∼N (ri,j | S>i Vj , τ)


(Si | G, µS ,ΛS) ∼N (Si | µS + Ui + ‖ai,·‖−1∑j

Ujai,j ,ΛS)

(Vj | µV ,ΛV ) ∼N (Vj | µV ,ΛV ).

(5.10)

Here, Ui is an individual user feature, and Si is an additional features, which we will refer to as a

“shifted” user feature. The shifted user feature is used for prediction of the user ratings ri,j for a given

user i. The additional layer allows the model to be flexible in accounting for the network information.

By adjusting the magnitude of an individual’s feature Ui relative to the first degree neighbours, and

by adjusting the noise term ΛS , the model can give preference to individual tastes over those of the

first degree neighbours. Relating back to a model with tuning parameters [20], this flexibility allows the

model to adjust the trade-off α in a probabilistic manner between user tastes and social tastes at a user

level.

If ‖ai‖ = 0, then the sum is empty, and we define the prior mean for Si to be µs + Ui.

In the Bayesian context, we place standard Gaussian-Wishart priors on the feature vector hyper-

paramters


(µS ,ΛS) ∼N (µS | µ0, β0ΛS) · W(ΛS | ν0,Λ0)

(µV ,ΛV ) ∼N (µV | µ0, β0ΛV ) · W(ΛV | ν0,Λ0).

(5.11)

5.4 Inference

The choice of a conjugate prior leads to analytically tractable Gibbs updates. The sampling distribu-

tion for the item features Vj are identical to PMF case. In this section, we summarize the sampling

distributions for the user features and the shift features.

For the shift features Si,


(Si | ·) ∼N (Si | µSi ,ΛSi)

where ΛSi =ΛS + τ∑j

Ii,jVjV>j

µSi =Λ−1Si

ΛS

µS + Ui + ‖ai‖−1∑j

Ujai,j

+ τ∑j

Ii,jVjri,j

.

(5.12)

For the user features Ui,

(Ui | ·) ∼N (Ui | µUi ,ΛUi)

where ΛUi =ΛU + ΛS + ΛS

∑k 6=i

a2k,i‖ak‖

µUi =Λ−1Ui

ΛUµU + ΛS

Si − µS − ‖ai‖−1∑j

Ujai,j

+ ΛS

∑k 6=i

ak,i‖ak‖

Sk − µS − Uk − ‖ak‖−1∑j

Ujak,j

.

(5.13)

Unlike the previous work, our model does not suffer from the pathological behaviour outlined in

Section 5.2.2. For the special two-user case outlined, the above simplifies to

µS1 =Λ−1S1ΛS (µS + U1 + U2) + τ

∑j

I1,jVjr1,j

µS2=Λ−1S2

ΛS (µS + U1 + U2) + τ∑j

I2,jVjr2,j

µU1=Λ−1U1

ΛUµU + ΛS (S1 + S2)− 2µS − 2U2

µU2 =Λ−1U2ΛUµU + ΛS (S1 + S2)− 2µS − 2U1.

(5.14)

So for any given user, the user feature U1 is constrained by the sum of the shift features S1 +S2 and

the other user’s feature U2. This prevents the user feature U1 from being directly dependent on the user

feature U2, avoiding the pathological behaviour.


The following experiments were conducted to test the performance of our proposed model to the vanilla

PMF model and the network model with dual L2 penalization of the user features [15]. Moving forward,

we refer to the model with dual penalization as the “Net model”, and our proposed model as the “Shift

model”.

To obtain MAP estimates, batch gradient descent on the equivalent energy functions for the vanilla

PMF, existing Net model, and the proposed Shift model was performed. These MAP estimates were used

to confirm the importance of using the user network, and to validate the proposed Shift model. Different

learning rates, momentum, and penalties on the features was explored in order to obtain suitable MAP

estimates. However, no grid search over these tuning parameters was completed. The end result is that

some MAP estimates may be sub-optimal.


Appropriate MAP estimates for the proposed Shift model can be obtained naively by initializing

the user, item, and shift features at random and performing gradient descent. Alternatively, the MAP

estimate for the user and item features from the baseline PMF model can be used as a starting point

for gradient descent of the Shift model.

These MAP estimates obtained were then used to initialize Gibbs samplers for all models. Inference

for all three models were performed on one set of data to highlight the pathological behaviour for the

special case outlined previously.

Four data sets were used for experimentation: Epinions, Flixster, Ciao, and Filmtrust. Summary

statistics of these four data sets are given in Table 5.1. Epinions is a data set of users rating generic

products. The other three are data sets of users rating movies. The four data sets range in orders of

magnitude for both user base and item base. On average, users have more ratings in each data set than

outgoing links, though this is not true when looking at the median per user. Flixster is an exception

where half the users have less than four ratings, but less than five outgoing links.

The notion of using user networks in the collaborative filtering framework is built on the idea that

incorporating the user network into the prior will help predictive performance. Typically, more active

users will tend to have more ratings in the system and more outgoing links. A high number of outgoing

links for users with few or no ratings may suggest a noisy network. Table 5.1 also reports the Spearman

correlation between the number of ratings and the outdegree of the users in each data set. Note that

this is highest at nearly 0.6 for Ciao, dropping to a low of 0.1 for Filmtrust. Note also that the Ciao user

network is an order of magnitude more dense than the others. In the context of the data set, this means

that users tend to trust others product raters, but do necessarily review many products themselves. This

could be an indication that the Ciao user network may be noisy.

The training / test split of the ratings was varied across data sets to ensure sparsity, in particular,

to ensure there was a sufficient number of users with no ratings and O(1) ratings in the training set.

We report the results using the train / test splits indicated in Table 5.1. Other train / test splits were

explored, with the general observation that the performance difference between the two models decreased

as the amount of data to train on increased. In any case, using the true network did not result in the

proposed model producing statistically worse results than the baseline PMF model.

For gradient descent, the 80% training set was subsequently split into a 70% training, 30% validation

set.

Similar to the results from previous chapters, we report the overall test RMSE, as well as test RMSE

for subsets of users with different frequency in the training set.


Figure 5.1 illustrates the test RMSE for users of different frequency in the training set for the four data

sets. For Epinions, Flixster, and Filmtrust, the trend is for the test RMSE to decrease as the number

of ratings increases. This trend is not as strong in the Ciao test set, where there is little change as the

number of ratings increases. The trend for the RMSE to decrease as the number of ratings increases

is present in the training set, where the binned RMSE values range from 3 − 18% lower than they do

on the test set. These two observations would suggest overfitting on the test set, despite the use of a

validation set to terminate gradient descent.

For all data sets, both the Net model and the proposed Shift model consistently result in lower test


Table 5.1: Summary statistics on the four data sets used for experimentation.

Epinions Flixster Ciao FilmtrustUsers 49,289 109,218 7,375 1,508Items 139,738 42,173 106,797 2,071

Rating Matrix Sparsity 0.0097% 0.1331% 0.0361% 1.1366%Mean Ratings / User 13 56 39 24

Median Ratings / User 4 4 18 16User Network Sparsity 0.0201% 0.0113% 0.2055% 0.0718%

Mean Outdegree / User 10 12 15 1Median Outdegree / User 1 5 3 0

Rating / Degree Correlation 0.4607 0.2056 0.591 0.0996Training 80% 15% 50% 70%

RMSE than the baseline PMF model. The comparison between the Net model and the Shift model is

less clear. For some subsets of users in some data sets (users with two ratings in Epinions), the Net

model outperforms our proposed Shift model, while the converse is true for other subsets of users in

other data sets (ex: users with no ratings in Filmtrust). These results confirm the benefit of including

user networks as constraints in the model.

5.6.1 Pathological Network Behaviour

We initialize a Gibbs sampler using the MAP estimates for each of the three models on the Epinions data

set in order to highlight the inferential difficulties for users with low outdegree and low rating frequency.

With respect to the overall test RMSE, the Net model performs better both in mixing and final error

rate than the standard PMF model. This is illustrated in Figure 5.2 (a), where the overall test RMSE

over iterations is plotted for all three models. However, our results show that the Net model actually

performs worse for rare users, on average, than the standard PMF model after running a Gibbs sampler.

This is illustrated in Figure 5.2 (b), where the final test RMSE for users of different frequency is plotted

for the three models. In particular, the Net model now performs worse than the model with no user

network for users with O(1) ratings.

The shift features used for predicting the ratings incorporate the network dependence, and do not

suffer from this performance loss for rare users. Figure 5.2 (b) illustrates that the proposed model

outperforms the PMF model with no network for users of all frequency, even users with no ratings.

Figure 5.2 (a) shows that the added complexity of the hierarchical model has nearly the same convergence

rate as the Net model in the initial samples. In other words, the initial mixing rate of the sampler does

not appear to be affected by the extra layer of features for the users.

This illustrates the increase in test RMSE in the Net model is driven by a subset of users, those with

a low number of ratings in the training set. In addition, there is unusual behaviour for the estimates of

the user features for those users with a lower number of rating in the training set and low outdegree. This

is the special case we previously highlighted in Section 5.2.2. Let ‖ri‖ =∑j = ri,j denote the frequency

of user i in the training set, and recall that ‖ai‖ was defined as the outdegree of user i. Consider a

modified geometric mean of these two terms:

√(‖ri‖+ 1) · (‖ai‖+ 1). (5.15)

This is the geometric mean of user outdegree and user frequency, with each incremented by 1. This


Figure 5.1: Test RMSE for users of different frequency for the features learned from MAP estimation.

(a) Epinions

[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25)1.15

1.2

1.25

1.3

1.35T

es

t R

MS

E

PMF

Net

Shift

(b) Flixster

[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25)1

1.05

1.1

1.15

1.2

1.25

Te

st

RM

SE

PMF

Net

Shift

(c) Filmtrust

[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25)0.75

0.8

0.85

0.9

0.95

1

Te

st

RM

SE

PMF

Net

Shift

(d) Ciao

[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25)0.9

0.95

1

1.05

1.1

1.15

Te

st

RM

SE

PMF

Net

Shift

modified geometric mean is plotted against the user feature norms sampled in the final run for the Net

model in Figure 5.3 (a) and against the shift feature norms sampled in the final run for the proposed

Shift model in Figure 5.3 (b). With the Net model, L2 norms in excess of 20 are found for users where

this modified geometric mean is small. These L2 norms correspond to users with low outdegree and

low frequency in the training set. These L2 norms decrease an order of magnitude for more frequent

users (as defined by this geometric mean). Such a relationship is unexpected, and is counter to the

prior. Indeed, the prior states that users with no ratings and no connections should have a Gaussian

distribution centered around zero. With few ratings, the model prior should dominate and these features

should have distributions centered close to zero. These large L2 norms for these relatively inactive users

are not probable under the model prior. Tied with the increase in test RMSE in Figure 5.2, this is

suggestive of poor generalization.

This issue of large norms is not the case for the proposed Shift model. Figure 5.3 (b) illustrates

that the norms do not have excessive magnitude for the same set of users. The vertical scale for the

Shift feature norm is approximately 0− 3, and is approximately 0− 30 for the norm of the user features

sampled under the Net model. Figure 5.3 (c) plots the absolute change in user test RMSE between the

proposed Shift model and the Net model against this same metric. This plot is symmetric about 0 when

the modified geometric mean is greater than 5, suggesting neither model provides consistently better

RMSE for these users. When it is less than five, there is a tendency for this quantity to be negative.


Figure 5.2: Test RMSE under Gibbs sampling for the Epinions data set under the PMF, Net, andproposed Shift model.

(a)

0 50 100 150 200 250 300 350 400 450 5001.04

1.06

1.08

1.1

1.12

1.14

1.16

1.18

1.2

1.22

Iteration

Test

RM

SE

PMF

Net

Shift

(b)

[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25) [25−50) [50−100) 100+1

1.05

1.1

1.15

1.2

1.25

Number of Ratings

Test

RM

SE

BaselineNetShift

Figure 5.3: (a) User feature L2 norm under the Net model, (b) Shift feature L2 norm under the Shiftmodel, and (c) absolute change in user test RMSE in the Epinions data set versus the modified geometricmean of rating and outdegree. Note that (a) and (b) are not on the same vertical scale.

(a)

0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

Modified Geometric Mean

User F

eatu

re N

orm

(b)

0 1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3


Sh

ift

Featu

re N

orm

(c)

0 5 10 15 20 25−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2


Ab

so

lute

Ch

an

ge in

User

Test

RM

SE

This reflects the decrease in test RMSE for these rare users under the proposed Shift model relative to

the existing Net model.

5.6.2 Shift Model Performance

Using the MAP estimates obtained for Epinions, Flixster, and Filmtrust, Gibbs samplers were run for

500 iterations for the baseline PMF model and the proposed Shift model. The Gibbs samplers for

Ciao, when initialized from the MAP estimates obtained, was found to perform worse than starting

from random. In particular, the baseline PMF model was trapped in a local optimum, and showed no

decrease in test RMSE for several hundred iterations. This was not found for the proposed Shift model,

which started to decrease immediately. To present a less extreme comparison of the models for this data

set, we present results for Ciao with the Gibbs samplers started with random latent features.

In overall test RMSE, our proposed Shift model tends to have a faster convergence rate than the

vanilla PMF model. Figure 5.4 illustrates the overall error rate of our proposed Shift model and the

PMF model for the four data sets considered.

Similar gains are seen for users of different rating frequency in the training set. Figure 5.5 plots


Figure 5.4: Overall test RMSE under Gibbs sampling for the four data sets considered. The proposedShift model consistently outperforms the baseline PMF model.

(a) Epinions

0 100 200 300 400 5001

1.05

1.1

1.15

1.2

1.25

Iteration

Te

st

RM

SE

PMF

Shift

(b) Flixster

0 100 200 300 400 5000.895

0.9

0.905

0.91

0.915

0.92

0.925

0.93

0.935

Iteration

Te

st

RM

SE

PMF

Shift

(c) Ciao

0 100 200 300 400 5000.95

1

1.05

Iteration

Te

st

RM

SE

PMF

Shift

(d) Filmtrust

0 100 200 300 400 5000.77

0.78

0.79

0.8

0.81

0.82

Iteration

Te

st

RM

SE

PMF

Shift

the relative change over samples in the test RMSE in the Shift model from the PMF model for users of

different frequency in the four different data sets. Here, a negative relative change means the Shift model

has a lower test RMSE for that set of users in that data set for that sample when compared to the PMF

model. The strongest results are seen in Epinions and Ciao, with consistent but smaller improvements

for Flixster and Filmtrust. The final relative gains achieved by the Shift model relative to the baseline

PMF model for users of different frequency is listed in Table 5.2. The relative changes are generally in

favour of the Shift model. The few exceptions (ex: users with one or two ratings in Flixster) are small

(ex: less than 0.10% for these users) and not practically significant). The largest counterexample is for

users at least 100 ratings in the Ciao data set (relative increase in test RMSE of 0.88%), but is minor in

comparison to the 2.3% relative drop in test RMSE seen for much less frequent users.

5.6.3 Fake Networks

Most user networks are noisy, with user links not necessarily conveying taste similarity. To investigate

how our proposed model performs when the user network contains less than perfect information, we

look at the performance of the MAP estimate under the following cases. First, a partially observed user

network where each link is included with probability 0.5. Second, a user network with links generated at


Table 5.2: Test Error for the four data sets by user frequency

(a) Epinions

Frequency Baseline Shift Relative Change[0− 1) 1.1701 1.1634 -0.57%[1− 2) 1.1313 1.1223 -0.80%[2− 3) 1.0879 1.075 -1.19%[3− 4) 1.0739 1.0607 -1.23%[4− 5) 1.1047 1.0958 -0.81%

[5− 10) 1.0763 1.0635 -1.19%[10− 25) 1.0738 1.0558 -1.68%[25− 50) 1.0715 1.0424 -2.72%

[50− 100) 1.0652 1.0293 -3.37%100+ 1.0562 1.0056 -4.79%

(b) Flixster

Frequency Baseline Shift Relative Change[0− 1) 1.1106 1.1075 -0.28%[1− 2) 1.0345 1.0353 0.08%[2− 3) 0.9908 0.991 0.02%[3− 4) 0.9747 0.974 -0.07%[4− 5) 0.9468 0.9462 -0.06%

[5− 10) 0.9466 0.9454 -0.13%[10− 25) 0.9346 0.9331 -0.16%[25− 50) 0.9079 0.9077 -0.02%

[50− 100) 0.8743 0.8753 0.11%100+ 0.8406 0.8395 -0.13%

(c) Ciao

Frequency Baseline Shift Relative Change[0− 1) 0.8734 0.8727 -0.08%[1− 2) 0.9752 0.9399 -3.62%[2− 3) 0.9941 0.9698 -2.44%[3− 4) 1.006 0.9783 -2.75%[4− 5) 1.0094 0.9807 -2.84%

[5− 10) 1.0041 0.9719 -3.21%[10− 25) 1.0353 0.9955 -3.84%[25− 50) 1.0201 0.9785 -4.08%

[50− 100) 1.0423 0.9934 -4.69%100+ 0.8043 0.8114 0.88%

(d) Filmtrust

Frequency Baseline Shift Relative Change[0− 1) 0.8686 0.8657 -0.33%[1− 2) 0.8094 0.8097 0.04%[2− 3) 0.9946 0.9839 -1.08%[3− 4) 0.7002 0.6911 -1.30%[4− 5) 0.7122 0.7127 0.07%

[5− 10) 0.804 0.8039 -0.01%[10− 25) 0.742 0.7409 -0.15%[25− 50) 0.7727 0.7711 -0.21%

[50− 100) 0.8736 0.8718 -0.21%


Figure 5.5: Change in test RMSE over Gibbs samples for the proposed Shift model relative to thebaseline PMF model. Negative values correspond to the proposed Shift model performing better thanthe baseline PMF model.

(a) Epinions

0 100 200 300 400 500−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

Iteration

Re

lati

ve

Te

st

RM

SE

[0−1)

[1−2)

[2−3)

[3−4)

[4−5)

[5−10)

(b) Flixster

0 100 200 300 400 500−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

Iteration

Re

lati

ve

Te

st

RM

SE

[0−1)

[1−2)

[2−3)

[3−4)

[4−5)

[5−10)

(c) Ciao

0 100 200 300 400 500−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

Iteration

Re

lati

ve

Te

st

RM

SE

[0−1)

[1−2)

[2−3)

[3−4)

[4−5)

[5−10)

(d) Filmtrust

0 100 200 300 400 500−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

Iteration

Re

lati

ve

Te

st

RM

SE

[0−1)

[1−2)

[2−3)

[3−4)

[4−5)

[5−10)

random. Many random networks were attempted. The results presented here had the random networks

generated according to Algorithm 5.1

Gradient descent on the energy functions with these networks was performed, and the resulting error

rates for users of different frequency are considered. Figure 5.6 plots the test RMSE relative to the

standard PMF model for the Shift model with random network, partially observed network, and fully

observed network. We note that the random network offers no overall advantage relative to the standard

PMF model, and performs worse for rare users.

With these MAP estimates, we run a Gibbs sampler and compare the resulting test performance.

Figure 5.8 illustrates the resulting test error for the models with no network, a partial network, the full

network, and a random network for the four data sets. There are several observations to make.

1. In three cases (Epinions, Ciao, and Filmtrust), the partially observed network performs better

than the model with no network, but worse than the full network;

2. For the fourth data set (Flixster), the partially observed and random networks are both worse than

the model with no network;


Algorithm 5.1: Generation of Random Networks

Compute: P (∑j ai,j = 0) = E[

∑j ai,j == 0], the probability of a user having a link ;

Compute: ‖ai‖ = E[ai], the average outdegree per user ;Compute:

∑i ai,j | j = 1, . . . , N, the indegree for each user;

for User i = 1, . . . , N doSample p ∼ Unif(0, 1);if p > P (

∑j ai,j = 0) then

Sample links for current user with probability proportional to the indegree of all otherusers;

elseGenerate no links for user i

end

end

Figure 5.6: Test RMSE relative to the PMF model for the random, partially observed, and fully observednetwork.

[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25) [25−50) [50−100) 100+−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

Rela

tive T

est R

MS

E

Partial Net

Full Net

Fake Net

3. In three cases (again: Epinions, Ciao, and Filmtrust), the random network performs better than

no network, but worse than any real network.

The first is expected. Partial new information is being introduced. It is expected that this would

improve over the model with no network, but carry less information than the full network.

The second is suggestive of overfitting. The relative performance of the four models has changed

order between the training set and test set. The partial and random networks, achieving the best error

rates in the training set, are are achieving the worst error rates in the test set (worse than no network).

The third is not expected. There is no reason a priori to believe that random noisy links will

improve over a model with no links. There are a few possible explanations to this, which we devote to

the following section

Random Links

The experimental results indicated that completely random networks perform better than no network

in several data sets. There are some possible explanations to this.

First, there is a positivity bias in the ratings for all of these data sets, as is common with recommender

systems [24, 23]. In other words, users tend to provide explicit (positive) feedback for those items that

they are interested in and enjoy. Given this, there is a weak correlation over ratings in the entire data


set. This weak correlation may be encoded in the random network. Although not as predictive as the

real network, the random network is providing information to the Gibbs sampler.

Second, the model proposed constrains any two pairs of users. To illustrate, suppose we have four

users, and the adjacency graph is cyclic. That is, the adjacency graph is given by

A =

0 1 0 0

0 0 1 0

0 0 0 1

1 0 0 0

(5.16)

In this case, the features are pairwise constrained. That is, U1 is constrained to U2, which is con-

strained to U3, which is in turn constrained to U4, and back to U1. Random networks can include, for

instance,

A1 =

0 0 0 1

0 0 1 0

0 1 0 0

1 0 0 0

, A2 =

0 1 0 0

0 0 1 0

0 0 0 0

1 0 1 0

, A3 =

0 0 1 0

0 0 0 1

0 0 0 1

1 0 0 0

. (5.17)

Each of these three random networks is probable under Algorithm 5.1. However, each of these

networks is also formed from second, third, or fourth degree connections in the real network. The

transitivity of the constraints under the real model means, for instance, that the constraint between U1

and U4 imposed in A1 is weakly present in the real network A. This may be another possible explanation

why the random networks are performing better than no network.

Third, a much more ambitious explanation is that the sampler is able to “learn” what the right

constraints should be from the data. Recall that the user shift features used for rating prediction are

modeled as Gaussians, with mean,

E[Si] = Ui + ‖ai‖−1∑j

Ujai,j .

Following with the same notation, suppose we have two networks:

1. A real network A with links ai,j ;

2. A random network A with links ai,j .

The difference between the shift under the true network and the random network is

‖ai‖−1∑j

Ujai,j − ‖ai‖−1∑j

Uj ai,j . (5.18)

The first term is the shift that should occur under the true network, while the second term is the

shift that will occur with the random network being observed. For the Gibbs run with the random

network, we compute this term for each user in each Gibbs sample. The L2 norms are computed for

each user i at each iteration t, and normalized by the shift feature S(t)i . For each iteration t, quantiles

of these differences are computed, and we plot the value of these quantiles against iteration in Figure

5.9 for the four data sets. For two data sets where the random network model outperforms the vanilla

PMF model with no network (Epinions and Ciao), these norms drop quickly over samples. For a third


(Filmtrust), these remain large, but the difference between the random network and the vanilla PMF

model in test error is small, Figure 5.7. For the data set where the random network performs worse than

the vanilla PMF model, these are larger than the same quantiles under the Epinions and Ciao data set.

This suggests that the model may be able to infer the correct shift needed based on the observed data.

It should be noted from Table 5.1 that the data set where the test error for the random network

is not better the model with no network, Flixster, is also the least dense user network. It also has the

largest user base. In addition, Table 5.1 indicates that the Flixster data set has the lowest correlation

between the number of ratings a user has and the outdegree. The low density and large user base means

that the random network generated, which simulates the density and properties of the real network, will

be less likely to have second and third degree connections. This supports the second theory. The low

correlation between the number of ratings and the outdegree means that the network may be dominating

over the ratings in the model.

Figure 5.7: Training RMSE under Gibbs sampling for the four data sets under the Shift model withfully observed user network, partially observed user network, and completely random user network. Thetest RMSE for the baseline PMF model is included.

(a) Epinions

0 50 100 150 200 250 300 350 400 450 5000.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

Iteration

Test

RM

SE

Baseline

Half Truth

Full Truth

False

(b) Flixster

0 50 100 150 200 250 300 350 400 450 5000.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

Iteration

Test

RM

SE

Baseline

Half Truth

Full Truth

False

(c) Ciao

0 50 100 150 200 250 300 350 400 450 5000.5

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

Iteration

Test

RM

SE

Baseline

Half Truth

Full Truth

False

(d) Filmtrust

0 50 100 150 200 250 300 350 400 450 5000.55

0.6

0.65

0.7

0.75

0.8

0.85

Iteration

Test

RM

SE

Baseline

Half Truth

Full Truth

False


Figure 5.8: Test RMSE under Gibbs sampling for the four data sets under the Shift model with fullyobserved user network, partially observed user network, and completely random user network. The testRMSE for the baseline PMF model is included.

(a) Epinions

0 50 100 150 200 250 300 350 400 450 5001.04

1.06

1.08

1.1

1.12

1.14

1.16

1.18

1.2

1.22

Iteration

Test

RM

SE

Baseline

Half Truth

Full Truth

False

(b) Flixster

0 50 100 150 200 250 300 350 400 450 5000.88

0.885

0.89

0.895

0.9

0.905

0.91

0.915

0.92

Iteration

Test

RM

SE

Baseline

Half Truth

Full Truth

False

(c) Ciao

0 50 100 150 200 250 300 350 400 450 5000.95

1

1.05

1.1

Iteration

Test

RM

SE

Baseline

Half Truth

Full Truth

False

(d) Filmtrust

0 50 100 150 200 250 300 350 400 450 5000.76

0.77

0.78

0.79

0.8

0.81

0.82

0.83

0.84

Iteration

Test

RM

SE

Baseline

Half Truth

Full Truth

False

Two Asymmetric User Sets

To test the theory of weak correlation among all rating submitted, the data can be modified to create

two user sets with asymmetric tastes. Consider the following experiment. For each (user, item, rating)

= (u, i, ri,j) triplet, we randomly decide to keep the rating as is, or to “flip” the rating to the opposite (a

five becomes a 1, a 4 becomes a 2, etc). If the rating is flipped, we replace the user index u with 2u. This

effectively duplicates each user by creating one with the opposite rating patterns. Each link in the user

network (u1, u2) was modified to (u1, 2u2). Effectively, this has each user following the duplicate user

with completely opposite rating patterns. This represents an extreme case where there are two subsets

of users with asymmetric tastes. User links exist between the two user subsets, but not within the two

users sets. There is positivity bias among one set of users, but a negativity bias among the other set of

users.

Under this pathological user network, the shift model performs worse than the baseline overall.

Figure 5.10 (a) illustrates the test error over sampling runs. The performance of the model with no


Figure 5.9: Quantiles of the relative L2 magnitude of the difference in the network-induced shift factorin the fake network case from what it would be if the true network was used.

(a) Epinions

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

No

rm

ali

ze

d S

hif

t D

iffe

re

nc

e

0.05

0.25

0.50

0.75

0.95

(b) Flixster

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

No

rm

ali

ze

d S

hif

t D

iffe

re

nc

e

0.05

0.25

0.50

0.75

0.95

(c) Ciao

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

No

rm

ali

ze

d S

hif

t D

iffe

re

nc

e

0.05

0.25

0.50

0.75

0.95

(d) Filmtrust

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

No

rm

ali

ze

d S

hif

t D

iffe

re

nc

e

0.05

0.25

0.50

0.75

0.95

network is also included. The relative difference between the two does shrink marginally over samples,

but the model with no network still performs better in test RMSE.

The same performance is present when we examine subsets of users with different frequency in

the training set. Figure 5.10 (b) illustrates the test RMSE for users with different number of ratings

at training time. The model with no network outperforms the model with the pathological network

between the two contrived sets of users for all frequencies. In particular, the relative difference is most

notable for the least frequent users, shrinking as the user becomes more frequent. This is expected, given

the rating data will dominate over the data from the user network as the number of ratings increases.

5.7 Conclusion

We have reviewed existing work on matrix factorization models that make use of user networks as

constraints in probabilistic priors. This model uses the user network to modify the prior mean for each

user feature. We reviewed the form of gradient descent update of the user features under the equivalent


Figure 5.10: (a) Overall test RMSE and (b) binned by user frequency in the case of two asymmetric usersets. Here, the shift model performs worse than the baseline PMF model, with the largest differencecoming from rare users.

(a)

50 100 150 200 2501.085

1.09

1.095

1.1

1.105

1.11

1.115

1.12

BaselineShift with User Duplication

(b)

[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25) [25−50) [50−100) 100+1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

Number of Ratings

Test

RM

SE

BaselineShift with User Duplication

energy function, and highlighted how the updates reduce to the baseline PMF model in the special case

of two intra-connected users who have no outward connections. Placing the existing model in the Gibbs

sampling framework, we highlighted how the existing model is prone to overfitting for such users, and

validated this analytical result with experimental results using the Epinions data set. We proposed

an alternative model based on a two-level hierarchy of features for users that avoids this issue. We

validated the performance of our proposed model against the baseline PMF model using Gibbs sampling

for multiple data sets. Our proposed model consistently has a higher drop in test RMSE over iterations

than the baseline PMF model, and frequently converges to a lower test RMSE than the baseline PMF

model.

Testing the performance of our proposed model in the presence of partial network information shows

only minor performance degradation, as measured by a increase in test RMSE over Gibbs samples,

compared to in the presence of full network truth. The performance under partial network information

was still superior to the performance under the baseline PMF model. Additional simulations with

completely random user networks show increasing performance degradation, as measured by test RMSE

over samples, compared to partial and full network truth. However, our proposed model under completely

random networks still performs significantly better than the baseline PMF model. The exact reason for

this is an open question and the focus of future work.

Chapter 6

Conclusion

Originally known best for video streaming services, recommender systems have evolved into a tool with

general applications to preference matching and information retrieval. Their applications extend to

applications such as friend suggestion, music suggestion, news aggregation, online dating, and general

preference matching. Collaborative-based systems have emerged as a common implementation as they

are scalable, efficient to learn, and are suited to a mix of media.

Many variants of a Gaussian probabilistic matrix factorization models have been proposed in the

literature, almost always reporting test performance superior to the baseline PMF model. Very little work

has addressed the issue of performance with respect to the amount of information a user provides. Indeed,

if a user is very active in the system, the predictions of these PMF models can be trusted. However,

much simpler models can perform nearly as well with large amounts of ratings for a particular user.

It can also be argued that these frequent users should not be the target of improved recommendation.

They are already active and committed to the system. More focus should be given to rare users in

the system, those with few to no ratings in the system. Not only do these users compose the majority

of users in the system, they are the ones that need to be given useful suggestions in order to increase

activity and engagement.

In Chapter 2, we reviewed an existing Constrained PMF model. We also reviewed two approaches

to inference common in the literature: Gibbs sampling and variational inference. We extended this

constrained PMF model to the fully Bayesian framework, and demonstrated that Gibbs sampling under

the fully Bayesian model uniformly outperforms MAP estimation for users of different frequency in the

system. In comparing Gibbs sampling and variational inference, we found cause to advocate for Gibbs

sampling in order to avoid overfitting and issues with tuning parameters.

In Chapter 3, we reviewed existing work on heteroskedastic PMF models. We demonstrated that the

gains previously reported were not coming from rare users at all. When using variational inference to

learn the model, we discovered overfitting and poor generalization. We proposed a truncated precision

model to overcome this overfitting issue in the variational context, and illustrated how the existing

heteroskedastic model and the baseline PMF model arise as limiting cases. We compared the performance

of the truncated model to both, and illustrated that sensible bounds can improve upon both.

In Chapter 4, demographic and other personal attributes were introduced as constraints for user

features. This was based on work from an experiment with a different context with an analogous goal

(predicting personal attributes from public information on Facebook). We demonstrated that these

75

Chapter 6. Conclusion 76

models perform nearly uniformly better than the baseline PMF model for users of different frequency.

Given the sparse nature of the demographics, we proposed a PCA-based approach that reduces the

amount of parameters in the model and still achieves superior performance.

In Chapter 5, we reviewed recent work on Social Recommendation Systems. These are models that

make use of an existing user network in the recommendation framework. We focused on a matrix

factorization model in particular, inline with the other models discussed. This model was demonstrated

to improve prediction for users, but was not generative, and so could not easily be extended to the

Bayesian framework. We proposed a fully Bayesian model in the same spirit as the existing model,

demonstrated that similar performance is achieved with MAP estimates, and demonstrated that Gibbs

sampling improves upon the MAP estimates, and over the baseline. Further, we illustrated the ability

of the model to outperform the baseline PMF model in the presence of contaminated user networks.

Questions still remain for most of these extensions. Among them:

• Is there a sensible way to adapt the bounds of the precisions in the heteroskedastic model, and

will this improve performance? In our experimental results, we demonstrated that, for bounds of

(1/n, n) for integer n, sensible n can be selected based on the scale of the ratings. Can this be

algorithmically selected?

• What demographics are the most predictive of underlying similarity in tastes, and can these be

learned in a supervised, partially supervised, or unsupervised manner? We found the gain for the

Flixster data set was larger than for the MovieLens data set. The only difference between the two

was the bins selected for the age range, and the inclusion of the user’s occupation in the MovieLens

data set.

• How does the tradeoff between contamination in the user network and gains in test prediction work?

We illustrated an extreme case of two subsets of users with opposite tastes, where the inclusion of

a user network results in performance worse than the baseline model. However, partially observed

and fully contaminated network were still often better than no network at all. While there are

possible explanations for this, there is no certainty.

Future work can address these questions. In addition, how can these methods be combined? Ensemble

methods were proven to be successful in the Netflix competition, and would be the most direct way to

combine the predictions. However, it is possible to imagine a model combining the ideas of Chapter 4 and

Chapter 5 where demographics shift individual user features, with an additional layer of user features

being shifted based on a user network.

In addition, the recent advances in deep learning has popularized the field, and deep learning methods

are now being applied to a variety of problems. Indeed, this thesis began with a review of current

approaches, including some (R)BM variants, which are “building-blocks” to deep generative models.

For more media-rich mediums, (eg: music, photos, videos), deep learning models are commonly used for

feature extraction. The feature output of such models can be used as auxiliary information and used an

inputs into matrix factorization models.

Appendices

77

Appendix A

Ancillary Results and Derivations

In this appendix, we provide a series of derivations and ancillary results needed to derive the given

results. Primarily, they are needed to obtain the variational lower bound and the conditionals for the

variables of interest.

A.1 Squared Error Term

In the derivation of the conditionals of the feature vectors, it was necessary to expand the squared error

term (ri,j − ri,j)2 and rewrite as constants plus a quadratic in terms of Ui, Vj , and Wk. We give these

three derivations here. For notational convenience, we surpress the bias terms, absorbing both γi and

ηj into ri,j .

A.1.1 Quadratic with Respect to User Features

In terms of the user feature vectors

(ri,j − ri,j)2

=

ri,j −(δUUi +δWni

M∑k=1

Ii,kWk

)>Vj

> ·ri,j −(δUUi +

δWni

M∑k=1

Ii,kWk

)>Vj

=

[(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)− δUU>i Vj

]>·

[(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)− δUU>i Vj

]

=

(ri,j − δWni

M∑k=1

Ii,kW>k Vj

)>− δUV >j Ui

· [(ri,j − δWni

M∑k=1

Ii,kW>k Vj

)− δUU>i Vj

]

=

(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)>(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)

− 2δU

(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)>V >j Ui + δ2UU

>i VjV

>j Ui,

(A.1)

where the last line follows as V >j UiU>i Vj = U>i VjV

>j Ui.

78

Appendix A. Ancillary Results and Derivations 79

A.1.2 Quadratic with Respect to Item Features

In terms of the item feature vectors

(ri,j − ri,j)2

=


M∑k=1

Ii,kWk

)>Vj


δWni

M∑k=1

Ii,kWk

)>Vj

=r2i,j − 2ri,j

(δUUi +

δWni

M∑k=1

Ii,kWk

)>Vj

+ V >j

(δUUi +

δWni

M∑k=1

Ii,kWk

)(δUUi +

δWni

M∑k=1

Ii,kWk

)>Vj .

(A.2)

A.1.3 Quadratic with Respect to Side Features

Finally, in terms of the side feature vector Wk, we have

(ri,j − ri,j)2

=


M∑k=1

Ii,kWk

)>Vj


δWni

M∑k=1

Ii,kWk

)>Vj

=

ri,j −

δUUi +δWni

∑k 6=m

Ii,kWk

> Vj− δW

niIi,mW

>mVj

>

·

ri,j −

δUUi +δWni

∑k 6=m

Ii,kWk

> Vj− δW

niIi,mW

>mVj

.

(A.3)

Let ri,j,−Wm=(δUUi + δW

ni

∑k 6=m Ii,kWk

)>Vj denote the prediction made without Wm. Then,

(ri,j − ri,j)2

=

[(ri,j − ri,j,−Wm)− δW

niIi,mW

>mVj

]>·[(ri,j − ri,j,−Wm)− δW

niIi,mW

>mVj

]= (ri,j − ri,j,−Wm)

>(ri,j − ri,j,−Wm)− 2

δWniIi,m (ri,j − ri,j,−Wm)V >j Wm

+δ2Wn2i

W>mVjV>j Wm.

(A.4)

A.2 Expectation of Certain Forms

A.2.1 Expectation of Quadratic Forms

Let x be a random vector with mean µ and covariance matrix Σ, and let Λ be a symmetric matrix. Then

E[x>Λx] =tr (ΛΣ) + µ>Λµ. (A.5)


Combined with iterated expectation, this is used to find some expectations in the variational lower

bound. An alternative is to expand the quadratic, which we give an example of below using the user

feature quadratic form.

A.2.2 User Quadratic Form

In computing the variational lower bound, we need to consider the expectation of quadratic forms such

as

EQ[(Ui − µU )>ΛU (Ui − µU )]

=EQ[U>i ΛUUi]− 2EQ[U>i ΛUµu] + EQ[µ>UΛUµU ].(A.6)

Which appear from the prior placed on the user, item, and side features. We compute the expectation

term by term.

For the first term

EQ[U>i ΛUUi] =EQ[EQ[U>i ΛUUi|ΛU ]]

=EQ[tr(ΛUΛ−1Ui

)+ µ>UiΛUµUi ]

=tr(EQ[ΛU ]Λ−1Ui

)+ µ>UiEQ[ΛU ]µUi

=νU tr(WUΛ−1Ui

)+ νUµ

>UiWUµUi .

For the second termEQ[U>i ΛUµU ] =EQ[EQ[U>i ΛUµU |ΛU , µU ]]

=EQ[EQ[Ui]>ΛUµU ]

=µ>UiEQ[EQ[ΛUµU |ΛU ]]

=µ>UiEQ[ΛU ]µU

=νUµ>UiWU µU .

For the final term,

EQ[µ>UΛUµU ]

=EQ[EQ[µ>UΛUµU ]|ΛU ]]

=EQ[tr(

ΛU Λ−1U

)+ µ>UΛU µU ]

=tr(

EQ[ΛU ]Λ−1U

)+ µ>UEQ[ΛU ]µU

=νU tr(WU Λ−1U

)+ νU µ

>UWU µU .

Together, the three terms give



=νU tr(WUΛ−1Ui

)+ νUµ

>UiWUµUi

− 2νUµ>UiWU µU

+ νU tr(WU Λ−1U

)+ νU µ

>UWU µU

=νU

[(µUi − µU )>WU (µUi − µU ) + tr

(WU (Λ−1Ui + Λ−1U )

)].

(A.7)

Similar expressions hold for the items and the side features.

A.2.3 Gamma Random Variable Expectation

If X ∼ G(α, β), with pdf fX(x|α, β) ∝ xα−1e−βx.

E[logX] =− log(β) + ψ(α),

where ψ(·) = dd· log Γ(·). This result important in computing the contribution to the variational lower

bound from the user, item, and global precisions.

A.2.4 Wishart Random Variable Expectation

If X ∼ W(n,V), with pdf fX(X|n,V) ∝ |X|(n−p−1)/2e−tr(V−1X)/2,

E[log |X|] =

p∑i=1

ψ

(n+ 1− i

2

)+ 2 log 2 + log |V|

Like the last result, this is necessary to compute the variational lower bound, as it appears from the

conjugate Normal-Wishart priors.

Appendix B

Constrained PMF

In this appendix, we derive the conditional distribution of the features given the observed rating data

with the presence of the side information from Chapter 2. The inclusion of side information into the

model shifts the mean of the user features, and is also rederived. The conditional for the item features

follows by substituting the combination of user and side features for the user features in the original

derivation from [31].

For notational convenience, we surpress the offsets, absorbing γi and ηj into ri,j .

B.0.5 Conditional Posterior for Side Feature

The inclusion of the the side information Wm complicates the log likelihood contribution to the log

posterior. The square in the exponent of the Gaussian for ri,j becomes

log p(ri,j | · · · ) =

N∑i=1

M∑j=1

−Ii,jαiβjτ2

[ri,j − (δUUi + δW1

ni

m∑k=1

Ii,kWk)>Vj ]>

× [ri,j − (δUUi + δW1

ni

m∑k=1

Ii,kWk)>Vj ],

(B.1)

where ni =∑Mk=1 Ii,k. Using the properties of the transpose and expanding the square yields

N∑i=1

M∑j=1

−Ii,jαiβjτ2

[ri,j − V >j (δUUi + δW1

ni

m∑k=1

Ii,kWk)]

× [ri,j − (δUUi + δW1

ni

m∑k=1

Ii,kWk)>Vj ]

=

N∑i=1

M∑j=1

−Ii,jαiβjτ2

[r2i,j − 2ri,jV>j (δUUi + δW

1

ni

m∑k=1

Ii,kWk)

+ V >j (δUUi +δWni

m∑k=1

Ii,kWk)(δUUi +δWni

m∑k=1

Ii,kWk)>Vj ].

(B.2)

Expanding the quadratic in the final term and dropping terms independent of Wm, we obtain

82

Appendix B. Constrained PMF 83

N∑i=1

M∑j=1

−Ii,jαiβjτ2

[−2ri,jV>j δW

Ii,mni

Wm + 2δUδWV>j (Ui

Ii,mni

W>m)Vj

+ δ2WV>j (

Ii,mWm

ni+

∑k 6=m Ii,kWk

ni)(Ii,mWm

ni+

∑k 6=m Ii,kWk

ni)>)Vj .

(B.3)

Note the sum over Wk has been separated into the term involving Wm and the sum over the other

Wk, k 6= m.

Rearranging vectors to place it in the form µ>wΛwWm

N∑i=1

M∑j=1

−Ii,jαiβjτ2

[−2δW ri,jV>j

Ii,mni

Wm + 2δUδWU>i (VjV

>j )

Ii,mni

Wm

+ δ2WW>m(

Ii,mni

)2VjV>j Wm + 2δ2W

Ii,mni

(

∑k 6=m Ii,kWk

ni)>VjV

>j Wm.

(B.4)

Adding the log prior (Wm − µw)>Λw(Wm − µw)/2 and grouping terms linear in Wm and quadratic

in Wm, we obtain the system

ΛWm=Λw + δ2W τ

N∑i=1

M∑j=1

Ii,jIi,mαiβjn2i

VjV>j

µWm=Λ−1Wm

[Λwµw

+ τ

N∑i=1

M∑j=1

Ii,jαiβj

(ri,jVjδW

Ii,mni− δUδW (VjV

>j )Ui

Ii,mni

− δ2WVjV >jIi,mn2i

(∑k 6=m

Ii,kWk)

)].

(B.5)

Rewriting,

µWm=Λ−1Wm

[Λwµw + (1− u)τ

∑(i,j):

Ii,jIi,m=1

Ii,kαiβjni

Vj

((ri,j − δUV >j Ui)− δWV >j

∑k 6=m Ii,kWk

ni

)].

(B.6)

This can be re-expressed in a more compact form by defining the prediction made without Wm as

ri,j,−Wm =

δUUi +δWni

∑k 6=m

Ii,kWk

> Vj .This can be interpreted as the inner product between the jth feature vector and the error made by

all feature vectors with the mth side feature omitted.

This shorthand allows Equation (B.6) to be expressed as

µWm =Λ−1Wm


Ii,jIi,m=1

αiβjni

Vj (ri,j − ri,j,−Wm))

. (B.7)

Appendix B. Constrained PMF 84

B.0.6 Conditional Posterior for User Feature

With the inclusion of side features Wk, the log posterior for Ui becomes

log p(Ui| · · · ) =ταi2

M∑j=1

Ii,jβj(ri,j − ri,j)2 + (Ui − µU )ΛU (Ui − µU ). (B.8)

Expanding the squared term yields

(ri,j − ri,j)2

=


M∑k=1

Ii,kWk

)>Vj


δWni

M∑k=1

Ii,kWk

)>Vj

=

[(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)− δUU>i Vj

]>·

[(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)− δUU>i Vj

]

=

(ri,j − δWni

M∑k=1

Ii,kW>k Vj

)>− δUV >j Ui

· [(ri,j − δWni

M∑k=1

Ii,kW>k Vj

)− δUU>i Vj

]

=

(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)>(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)

− 2u

(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)>V >j Ui + δ2UU

>i VjV

>j Ui.

(B.9)

Plugging into Equation (B.8) and dropping terms not involving Ui yields

log p(Ui| · · · ) =ταi2

M∑j=1

Ii,jβj

−2u

(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)>V >j Ui + δ2UU

>i VjV

>j Ui

+ (Ui − µU )ΛU (Ui − µU ).

(B.10)

This shows that the conditional posterior for Ui is Gaussian with parameters

ΛUi =ΛU + δUταi

M∑j=1

Ii,jβjVjV>j

µUi =Λ−1Ui

ΛUµU + δUταi

M∑j=1

Ii,jβjVj

(ri,j − δWV >j

(∑Mk=1 Ii,kWk

ni

)) . (B.11)

Note that the inclusion of side information affects only the mean, not the precision.

Appendix C

Distributional form of the

Variational Approximation

In this section, we derive the optimal variational distribution under the mean field approximation of

equation (2.23). The subsections are as follows:

• In Subsection C.1, we derive the optimal variational distribution for the user features;

• In Subsection C.3, we derive the optimal variational distribution for the item features;

• In Subsection C.4, we derive the optimal variational distribution for the side features;

• In Subsection C.5, we derive the optimal variational distribution for the user, item, and global

precisions;

• In Subsection C.6, we derive the optimal variational distribution for the user hyperparameters. By

symmetry, the results for item and side hyperparameters follow immediately.

C.1 User Feature Vectors

For the user feature vectors, the terms involving Ui are the conditional expectation for the rating ri,j

and the prior for the feature vector Ui. We then have

M∑j=1

Ii,j log p(ri,j |Ui, Vj ,W1:m, αi, βj , τ) + log p(Ui|µU ,ΛU )

=− Ii,j2ταi

M∑j=1

βj(ri,j − ri,j)2 +1

2log |ΛU | −

1

2(Ui − µU )>ΛU (Ui − µU )

=− Ii,j2ταi

M∑j=1

βj

(−2δU

(ri,j −

δWni

M∑k=1

Ii,kW>k Vj

)V >j Ui + δUU

>i VjV

>j Ui

)

− 1

2(Ui − µU )>ΛU (Ui − µU ).

(C.1)

This shows the variational distribution for Ui is Gaussian, with parameters

85

Appendix C. Distributional form of the Variational Approximation 86

µUi =Λ−1Ui

ΛUµU + δUταi

M∑j=1

Ii,jβjVj

(ri,j − δWV >j

(∑Mk=1 Ii,kWk

ni

))ΛUi =ΛU + δUταi

M∑j=1

Ii,jβjVjV>j .

(C.2)

C.2 User Offset

If we include a user offset γi, then the relevant terms are

M∑j=1

Ii,j log p(ri,j | · · · ) + log p(γi|µγ , λγ)

M∑j=1

Ii,j2

log |ΛU | −Ii,j2ταi

M∑j=1

(ri,j − ri,j)2 +1

2log λγ −

λγ2

(γi − µγ)2.

(C.3)

The quadratic (ri,j − ri,j)2 can be rewritten as

(ri,j − ri,j)2 =(ri,j − γi − ηj − S>i Vj)2

=γ2i − 2γi(ri,j − ηj − S>i Vj) + (ri,j − ηj − S>i Vj)2.

Inserting into Equation C.3, taking expectations, and retaining terms involving γi only, we get that

the optimal distribution for γi is univariate Gaussian, with parameters

λγi =ταi

M∑j=1

Ii,jβj + λγ

µγi =λ−1γi

λγµγ + ταi

M∑j=1

Ii,jβj(ri,j − ηj − S>i Vj)

.

(C.4)

C.3 Item Feature Vectors

By symmetry, the terms involving Vj are

N∑i=1

Ii,j log p(ri,j |Ui, Vj ,W1:m, αi, βj , τ) + log p(Vj |µV ,ΛV )

=− Ii,j2βjτ

N∑i=1

αi(ri,j − ri,j)2 +1

2log |ΛV | −

1

2(Vj − µV )>ΛV (Vj − µV )

=− Ii,j2βjτ

N∑i=1

αi

−2ri,j

(δUUi +

δWni

M∑k=1

Ii,kWk

)>Vj

+ V >j

(δUUi +

δWni

M∑k=1

Ii,kWk

)(δUUi +

δWni

M∑k=1

Ii,kWk

)>Vj

− 1

2(Vj − µV )>ΛV (Vj − µV ).

(C.5)


This shows the variational distribution for Vj is Gaussian, with parameters

µVj =Λ−1Vj

[ΛV µV + τβj

N∑i=1

Ii,jαi(ri,j − (δUUi +δWni

M∑k=1

Ii,kWk)>Vj)

]

ΛVj =ΛV + τβj

N∑i=1

Ii,jαi(δUUi +δWni

M∑k=1

Ii,kWk)(δUUi +δWni

M∑k=1

Ii,kWk)>.

(C.6)

C.4 Side Feature Vectors

The terms involving Wm are

N∑i=1

M∑j=1

Ii,j log p(ri,j |Ui, Vj ,W1:m, αi, βj , τ) + log p(Wm|µW ,ΛW )

=τ

2

N∑i=1

M∑j=1

Ii,jαiβj(ri,j − ri,j)2 −1

2(Wm − µW )>ΛW (Wm − µW )

=− τ

2

N∑i=1

M∑j=1

Ii,jαiβj

(−2

δWniIi,m(ri,j − ri,j,−Wm)V >j Wm +

δWn2i

W>mVjV>j Wm

)− 1

2(Wm − µW )>ΛW (Wm − µW ),

(C.7)

where ri,j,−Wm=(δUUi + δW

ni

∑k 6=m Ii,kWk

)>Vj denotes the prediction made without Wm. This

shows the variational distribution for Wm is Gaussian with parameters

µWm=Λ−1Wm


Ii,jIi,m=1

αiβjni

Vj (ri,j − ri,j,−Wm)

ΛWm

=ΛW + δW τ∑(i,j):

Ii,jIi,m=1

αiβjn2i

VjV>j .

(C.8)

Note the product of the two indicators Ii,jIi,m. The side information sum only considers those users

who rated this item, and then considers those items these users rated.

C.5 Precisions

The terms involving αi are

M∑j=1

Ii,j log p(ri,j |Ui, Vj ,W1:m, αi, βj , τ) + log p(αi|aU , bU )

=

M∑j=1

Ii,j2

logαi −τ

2

M∑j=1

Ii,jβj(ri,j − ri,j)2 + (aU − 1) logαi − bUαi.

(C.9)

This shows the variational distribution for αi is Gamma, with parameters


aUi =aU +1

2

M∑j=1

Ii,j

bUi =bU +τ

2

M∑j=1

Ii,jβj(ri,j − ri,j)2.

(C.10)

Identical derivations show the variational distributions for βj and τ are Gamma, with parameters

aVj =aV +1

2

N∑i=1

Ii,j

bVj =bV +τ

2

N∑i=1

Ii,jαi(ri,j − ri,j)2

aτ =aτ +1

2

N∑i=1

M∑j=1

Ii,j

bτ =bτ +1

2

N∑i=1

M∑j=1

Ii,jαiβj(ri,j − ri,j)2.

(C.11)

C.6 User / Item / Side Feature Hyperparameters

The terms involving the user hyperparameters µU ,ΛU are

N∑i=1

log p(Ui|µU ,ΛU ) + log p(µU |µ0, β0ΛU ) + log(ΛU |ν0,W0)

=N

2log |ΛU | −

1

2

N∑i=1

(Ui − µU )>ΛU (Ui − µU )

+1

2log ΛU −

β02

(µU − µ0)>ΛU (µU − µ0)

+ν0 − d− 1

2log |ΛU | −

1

2tr(W−10 ΛU

).

(C.12)

Using derivations involving the completion of the square present in the literature [6], the quadratic

terms can be rearranged as


N∑i=1

(Ui − µU )>ΛU (Ui − µU ) + β0(µU − µ0)>ΛU (µU − µ0)

=tr

([N∑i=1

(Ui − µU )(Ui − µU )> + β0(µU − µ0)(µU − µ0)>

]ΛU

)

=tr

([N(U − µU )(U − µU )> +

N∑i=1

(Ui − U)(Ui − U)> + β0(µU − µ0)(µU − µ0)

]ΛU

)

=tr

([(N + β0)(µU − µU )(µU − µU )> +

Nβ0N + β0

(U − µ0)(U − µ0)> +

N∑i=1

(Ui − U)(Ui − U)>

]ΛU

),

(C.13)

where we have defined

µU =NU + β0µ0

N + β0. (C.14)

We can now write the µU ,ΛU terms as

N∑i=1

log p(Ui|µU ,ΛU ) + log p(µU |µ0, β0ΛU ) + log(ΛU |ν0,W0)

=1

2log |ΛU | −

1

2(µU − µU )>[(N + β0)ΛU ](µU − µU )

+N + ν0 − d− 1

2log |ΛU |

− 1

2tr

(W−10 +

[Nβ0N + β0

(U − µ0)(U − µ0)> +

N∑i=1

(Ui − U)(Ui − U)>

]ΛU

).

(C.15)

Which shows the variational distribution for (µU ,ΛU ) is a Normal-Wishart with parameters

µU =NU + β0µ0

N + β0

ΛU =(N + β0)ΛU

νU =N + ν0

W−1U =W−10 +Nβ0N + β0

(U − µ0)(U − µ0)> +

N∑i=1

(Ui − U)(Ui − U)>.

(C.16)

Analogous statements (with the appropriate sample size, feature averages, etc) hold for the item and

side feature vectors.

Appendix D

Derivation of the Variational Lower

Bound

Chapter 2 defined the basic matrix factorization model for collaborative filtering and discussed the

extension of the constrained PMF model to the Bayesian framework. We outlined inference for the

relevant paramters under both Gibbs sampling and a variational mean field approximation. Chapter 3

extended this to a collection of heteroskedastic models. In this appendix, we derive the variational lower

bound for these models.

From Section 2.4.2, the lower bound takes the form

EQ[log p(θ,D)]−H(Q), (D.1)

where EQ denotes the expectation under the variational approximation Q, known as the expected

complete log-likelihood, and H(Q) denote the entropy of the distribution.

D.1 Complete Log-Likelihood

From the definition of the model, the expected complete log-likelhood is

90

Appendix D. Derivation of the Variational Lower Bound 91

EQ [log p(θ,D)]

=EQ [log p(U1:N , V1:N ,W1:K , α1:N , β1:N , τ, γ1:N , η1:M , µU ,ΛU , µV ,ΛV , µW ,ΛW , R)]

=

N∑i=1

M∑j=1

Ii,jEQ [log p(ri,j | U, V,W, γ, η, α, β, τ)]

+

N∑i=1

EQ [log p(Ui | µU ,ΛU )]

+

M∑j=1

EQ [log p(Vj | µV ,ΛV )]

+

M∑k=1

EQ [log p(Wk | µW ,ΛW )]

+

N∑i=1

EQ [log p(γi | µγ , λγ)]

+

M∑j=1

EQ [log p(ηj | µη, λη)]

+

N∑i=1

EQ [log p(αi | aU , bU )]

+

M∑j=1

EQ [log p(βj | aV , bV )]

+ EQ [log p(τ | aτ , bτ )]

+ EQ [log p(µU ,ΛU | µ0, β0, ν0,W0)]

+ EQ [log p(µV ,ΛV | µ0, β0, ν0,W0)]

+ EQ [log p(µW ,ΛW | µ0, β0, ν0,W0)] .

(D.2)

We will analyze each of the expectations in Equation (D.27) individually in the appropriately named

sections that follow.

D.1.1 Rating

For the conditional density of the rating,

EQ[Ii,j logP (ri,j |r−(i,j), Ui, Vj ,W1:m, αi, βj , τ

]=Ii,j2

EQ[logαi + log βj + log τ − ταiβj(ri,j − ri,j)2

]=Ii,j2

(EQ[logαi] + EQ[log βj ] + EQ[log τ ]

− EQ[τ ]EQ[αi]EQ[βj ][EQ[(ri,j − ri,j)2

]).

(D.3)

Expanding the quadratic


EQ[(ri,j − ri,j)2] =EQ[r2i,j − 2ri,j ri,j + r2i,j ]

=r2i,j − 2ri,jEQ[ri,j ] + EQ[r2i,j ].(D.4)

Linearity of expectation and the independence assumption from the variational approximation gives

a simple result for the second term,

−2ri,jEQ[ri,j ] =− 2ri,jEQ

(δUUi + δW

∑Mk=1 Ii,kWk

ni

)>Vj

=− 2ri,j

(δUEQ[U>i Vj ] +

δWni

M∑k=1

Ii,kEQ[W>k Vj

])

=− 2ri,j

(δUEQ[Ui]

>EQ[Vj ] +δWni

M∑k=1

Ii,kEQ[Wk]>EQ[Vj ]

)

=− 2ri,j

(δUµ

>UiµVj +

δWni

M∑k=1

Ii,kµ>WkµVj

).

(D.5)

For the second moment, expanding the square leads to three additional terms

EQ[r2i,j ] =δUEQ[V >j UiU

>i Vj

]+ 2

δUδWni

M∑k=1

Ii,kEQ[V >j UiW>k Vj ]

+

(δWni

)2

EQ

[(V >j

M∑k=1

Ii,kWk

)(M∑`=1

Ii,`W>` Vj

)].

(D.6)

For the first involving only user and items,

δUEQ[V >j UiU

>i Vj

]=δU tr

(EQ[VjV

>j [UiU

>i ]])

=δU tr(EQ[VjV

>j ]EQ[UiU

>i ])

=δU tr((VarQ[Vj ] + EQ[Vj ]EQ[Vj ]

>)(VarQ[Ui] + EQ[Ui]EQ[Ui]>))

=δU tr(

(Λ−1Vj + µVjµ>Vj )(Λ

−1Ui

+ µUiµ>Ui)).

(D.7)

For the term involving user, item and side features,

2δUδWni

M∑k=1

Ii,kEQ[V >j UiW>k Vj ]

=2δUδWni

M∑k=1

Ii,ktr(EQ[VjV

>j UiW

>k ])

=2δUδWni

M∑k=1

Ii,ktr(EQ[VjV

>j ]EQ[Ui]EQ[Wk]>]

)=2

δUδWni

M∑k=1

Ii,ktr(

(Λ−1Vj + µVjµ>Vj )µUiµ

>Wk

).

(D.8)


For the final term involving only item and side features,

(δWni

)2

EQ[(V >j

M∑k=1

Ii,kWk)(

M∑`=1

Ii,`W>` Vj)]

=

(δWni

)2

EQ[(V >j (

M∑k=1

Ii,kWkW>k +

∑k 6=`

Ii,kIi,`WkW>` )Vj)]

=

(δWni

)2( M∑k=1

Ii,kEQ[V >j WkW>k Vj ] +

∑k 6=`

Ii,kIi,`EQ[V >j WkW>` Vj ]

)

=

(δWni

)2( M∑k=1

Ii,ktr(EQ[VjV

>j WkW

>k ])

+∑k 6=`

Ii,kIi,`tr(EQ[VjV

>j WkW

>` ]))

=

(δWni

)2( M∑k=1

Ii,ktr(EQ[VjV

>j ]EQ[WkW

>k ])

+∑k 6=`

Ii,kIi,`tr(EQ[VjV

>j ]EQ[Wk]EQ[W>` ]

))

=

(δWni

)2( M∑k=1

Ii,ktr(


−1Wk

+ µWkµ>Wk

))

+∑k 6=`

Ii,kIi,`tr(

(Λ−1Vj + µVjµ>Vj )µWk

µ>W`

)).

(D.9)

Combining Equations (D.9), (D.7), (D.8) with the first order term in Equation (D.5) yields

EQ[(ri,j − ri,j)2]

=r2i,j

− 2ri,j

(δUµ

>UiµVj +

δWni

M∑k=1

Ii,kµ>WkµVj

)+ δU tr

((Λ−1Vj + µVjµ

>Vj )(Λ

−1Ui

+ µUiµ>Ui))

+ 2δUδWni

M∑k=1

Ii,ktr(


>Wk

)+

(δWni

)2 [ M∑k=1

Ii,ktr(


−1Wk

+ µWkµ>Wk

))

+∑k 6=`

Ii,kIi,`tr(


µ>W`

)].

(D.10)

Combining with the precision factors yields Equation (D.3),


EQ[Ii,j log p(ri,j |r−(i,j), Ui, Vj ,W1:m, αi, βj , τ)

]=

1

2

N∑i=1

M∑j=1

Ii,j

([− log bUi + ψ(aUi)− log bVj + ψ(aVj )− log bτ + ψ(aτ )

]

− aτ

bτ

aUibUi

aVj

bVj

[r2i,j − 2ri,j

(δUµ

>UiµVj +

δWni

M∑k=1

Ii,kµ>WkµVj

)+ δU tr

((Λ−1Vj + µVjµ

>Vj )(Λ

−1Ui

+ µUiµ>Ui))

+ 2δUδWni

M∑k=1

Ii,ktr(


>Wk

)+

(δWni

)2( M∑k=1

Ii,ktr(


−1Wk

+ µWkµ>Wk

))

+∑k 6=`

Ii,kIi,`tr(


µ>W`

))]).

(D.11)

D.1.2 User Features

For the conditional density of the user latent features,

EQ [log p(Ui|µU ,ΛU )]

=1

2EQ[log |ΛU |]−

1

2EQ[(Ui − µU )>ΛU (Ui − µU )].

(D.12)

For the quadratic form, we use conditional expectation as (µU ,ΛU ) is jointly a Normal-Wishart under

the variational approximation, hence not independent.


=EQ[U>i ΛUUi − 2U>i ΛUµ>U + µ>UΛUµU ]

=EQ[U>i ΛUUi]− 2EQ[Ui]>EQ[ΛUµ

>U ] + EQ[µ>UΛUµU ].

(D.13)

Using the trace on the first term yields,

EQ[U>i ΛUUi] =tr(EQ[ΛUUiU

>i ])

=tr(EQ[EQ[ΛUUiU

>i |ΛU ]]

)=tr

(EQ[ΛUEQ[UiU

>i |ΛU ]]

)=tr

(EQ[ΛU ]

(VarQ[Ui|ΛU ] + EQ[Ui]EQ[Ui]

>))=tr

(νUWU

(Λ−1Ui + µUiµ

>Ui

)).

(D.14)

Iterated expectation on the second gives

EQ[Ui]>EQ[ΛUµ

>U ] =EQ[Ui]

>EQ[EQ[ΛUµ>U |ΛU ]]

=EQ[Ui]>EQ[ΛU ]EQ[µU |ΛU ]>

=µ>Ui ν0WU µU .

(D.15)

While both techniques applied to the third yields,


EQ[µ>UΛUµU ] =tr(EQ[ΛUµUµ

>U ])

=tr(EQ[EQ[ΛUµUµ

>U |ΛU ]]

)=tr

(EQ[ΛU ]EQ[µUµ

>U |ΛU ]

)=tr

(EQ[ΛU ]

(VarQ[µU |ΛU ] + EQ[µU |ΛU ]EQ[µU |ΛU ]>

))=tr

(νUWU

(ΛU + µU µ

>U

)).

(D.16)

This simplifies to


=νU

[(µUi − µU )>WU (µUi − µU ) + β0

−1tr(WU Λ−1Ui

)].

(D.17)

The log-precision expectation gives

EQ[log |ΛU |] =

d∑i=1

ψ

(νU + 1− i

2

)+ d log 2 + log |WU |. (D.18)

Combining Equation (D.13) - (D.18) and dividing by two gives the contribution to the variational

lower bound from the user features,

EQ [log p(Ui|µU ,ΛU )]

=1

2

[ d∑i=1

ψ

(νU + 1− i

2

)+ d log 2 + log |WU |

− νU[(µUi − µU )>WU (µUi − µU ) + β0

−1tr(WU Λ−1Ui

)].

(D.19)

D.1.3 User Precision

For the conditional density of the user precision

EQ [log p(αi|aU , bU )]

=EQ [aU log bU − log Γ(aU ) + (aU − 1) logαi − bUαi]

=C + (aU − 1)EQ [logαi]− bUE [αi]

=C + (aU − 1)(− log bUi + ψ(aUi))− bUaUibUi

=C + (aU − 1)(− log bUi + ψ(aUi))− bUaUibUi

,

(D.20)

where ψ(·) is the Digamma function, ψ(·) = dd· log Γ(·).

D.1.4 User Bias

For the user bias γi, the contribution to the variational lower bound is


EQ[log p(γi)]

=1

2EQ[log λγ ]− λγ

2EQ[(γi − µγ)2]

=1

2log λγ −

λγ2

[VarQ[γi] + (EQ[γi]− µγ)

2]

=1

2log λγ −

λγ2

[λγi + (µγi − µγ)

2].

(D.21)

The item bias contributions are analogous.

D.1.5 User Hyperparamters

For the conditional density of the user hyperparameters (µU ,ΛU ), the contribution to the variational

lower bound is

EQ[log p(µU ,ΛU )]

=EQ[log p(µU |µ0, β0ΛU )] + EQ[log p(ΛU |ν0,W0)].(D.22)

The first term contains a factor of log |ΛU |, derived in Equation (D.18), and the quadratic with

respect to µU .

For the quadratic term, we rearrange under the trace to obtain

EQ[(µU − µ0)>β0ΛU (µU − µ0)]

=β0EQ[tr(ΛU (µU − µ0)(µU − µ0)>

)]

=β0tr(EQ[ΛU ]EQ[(µU − µ0)(µU − µ0)>]

)=β0tr

(EQ[ΛU ] · VarQ[µU − µ0] + EQ[µU − µ0]EQ[µU − µ0]>

)=β0tr

(EQ[ΛU ] · VarQ[µU ] + (EQ[µU ]− µ0)(EQ[µU ]− µ0)>

)=β0tr

(νUWU · Λ−1U + (µU − µ0)(µU − µ0)>

)=νUβ0

(tr(WU Λ−1U

)+ (µU − µ0)>WU (µU − µ0)

).

(D.23)

Subtracting Equation (D.23) from Equation (D.18) and dividing by two gives the contribution to

the lower bound from the conditional distribution for the user latent feature mean, the first term in

Equation (D.22).

EQ [log p(µU |µ0, β0ΛU )]

=1

2

[ d∑i=1

ψ

(νU + 1− i

2


− νUβ0(

tr(WU Λ−1U

)+ (µU − µ0)>WU (µU − µ0)

)].

(D.24)

For the second term in Equation (D.22), the Wishart on the user precision matrix, we have


EQ [log p(ΛU |W0, ν0)]

=ν0 − d− 1

2EQ [log |ΛU |]−

1

2

(tr(EQ[W−10 ΛU ]

))=ν0 − d− 1

2EQ [log |ΛU |]−

1

2tr(W−10 EQ[ΛU ]

)=ν0 − d− 1

2

[d∑i=1

ψ

(νU + 1− i

2

)+ p log 2 + log |WU |

]

− νU2

tr(W−10 WU

).

(D.25)

Combining Equations (D.24) and (D.25) yield the contribution of interest, Equation (D.22), as

EQ[log(p(µU ,ΛU )]

=EQ[log p(µU |µ0, β0ΛU )] + EQ[log p(ΛU |ν0,W0)]

=1

2

[ d∑i=1

ψ

(νU + 1− i

2


− νUβ0(

tr(WU Λ−1U

)+ (µU − µ0)>WU (µU − µ0)

)]ν0 − d− 1

2

[d∑i=1

ψ

(νU + 1− i

2

)+ p log 2 + log |WU |

]

− νU2

tr(W−10 WU

).

(D.26)

D.2 Etropy

Given the mean field approximation, the entropy factorizes into a series of terms,


H[Q] =− EQ[logQ(τ, α1:N , β1:M , U1:N , V1:M ,W1:M , µU ,ΛU , µV , λV , µW ,ΛW )]

=− EQ [logQ(τ)]

−N∑i=1

EQ [logQ(Ui)]

−N∑i=1

EQ [logQ(αi)]

−N∑i=1

EQ [logQ(γi)]

−M∑j=1

EQ [logQ(Vj)]

−M∑j=1

EQ [logQ(βj)]

−M∑j=1

EQ [logQ(ηj)]

−M∑k=1

EQ [logQ(Wk)]

− EQ [logQ(µU ,ΛU )]

− EQ [logQ(µV ,ΛV )]

− EQ [logQ(µW ,ΛW )] .

(D.27)

We will analyze each of the expectations in Equation (D.27) individually in the appropriately named

sections that follow.

D.2.1 Feature Vectors

We derive the contribution from a single user feature.

EQ[logQ(Ui)]

=1

2EQ[log |ΛUi ]−

1

2EQ[(Ui − µUi)>ΛUi(Ui − µUi)]

=1

2log |ΛUi |.

(D.28)

The first is parameter, hence constant, while the second term is zero as it is an expectation of a

quadratic form centered by the mean and scaled by the precision, see Section A.2.1.

D.2.2 Precision Terms

We derive the contribution from the global precision factor. The other precisions follow analogously.


EQ[logQ(τ)]

=aτ log bτ + log Γ(aτ ) + (aτ − 1)EQ[log τ ]− bτEE [τ ]

=aτ log bτ + log Γ(aτ ) + (aτ − 1)(− log bτ + ψ(aτ ))− bτaτ

bτ

=aτ log bτ + log Γ(aτ ) + (aτ − 1)(− log bτ + ψ(aτ ))− aτ=− aτ − log bτ − log Γ(aτ )− (aτ − 1)ψ(aτ ).

(D.29)

D.2.3 User Bias

For the user bias γi, the contribution to the entropy is

EQ[logQ(γi)]

=1

2EQ[log λγi ]−

λγi2

EQ[(γi − µγi)2]

=1

2log λγi −

λγi2

VarQ[γi]

=1

2log λγi −

λγi2

1

λγi

=1

2log λγi −

1

2.

(D.30)

The item bias contributions are analogous.

Hyperparameters

We derive the contribution from the user hyperparameters (µU ,ΛU ),

EQ [logQ(µU ,ΛU )]

=EQ[logQ(µU |ΛU )] + EQ[logQ(ΛU )]

=EQ

[β02

log |ΛU | −β02

(µU − µU )>ΛU (µU − µU )

]

+ EQ

[− νU

2log |WU |+

νU − d− 1

2log |ΛU | −

1

2tr(W−1U ΛU

)]=EQ

[β02

log |ΛU |]− EQ

[β02

(µU − µU )>ΛU (µU − µU )

]

+ EQ

[− νU

2log |WU |

]+ EQ

[νU − d− 1

2log |ΛU |

]− EQ

[1

2tr(W−1U ΛU

)].

(D.31)

The second expectation involving the quadratic form is zero, as before, while the third is a constant.

The remaining terms contribute


EQ [logQ(µU ,ΛU )]

=β02

[d∑i=1

ψ

(νU + 1− i

2+ d log 2

)+ log |WU |

]− 0

− νU2

log |WU |+νU − d− 1

2

[d∑i=1

ψ

(νU + 1− i

2


]− 1

2tr(W−1U νUWU

)=νU − d

2

[d∑i=1

ψ

(νU + 1− i

2

)+ d log 2

]− d

2log |W−1U | −

νUd

2.

(D.32)

Appendix E

Meta Constrained PMF

In this appendix, we derive the sampling distribution of the meta-feature vectors, or side features, for

the meta-constrained PMF model of Chapter 4. In addition, we derive the sampling distribution of the

user features in the presence of the meta-features.

E.1 Meta Features

Under the model, the terms involving the sampling distribution for Wk is the prior for Wk and all user

features,

log p(Wk | · · · ) =

N∑i=1

log p(Ui | · · · ) + log p(Wk)

=Constant− 1

2

∑i

(Ui − µU − ‖fi‖−1∑k

Wkfk,i)>ΛU (Ui − µU − ‖fi‖−1

∑k

Wkfk,i)

− 1

2(Wk − µW )>ΛW (Wk − µW )

=− 1

2

∑i

(Ui − µU − ‖fi‖−1∑j 6=k

Wjfj,i)>ΛU (Ui − µU − ‖fi‖−1

∑j 6=k

Wjfj,i)

− 2‖fi‖−1Wkfk,iΛU (Ui − µU − ‖fi‖−1∑j 6=k

Wjfj,i) + ‖fi‖−2fk,iW>k ΛUWk

− 1

2

[W>k ΛWWk − 2W>k ΛWµW + µ>WΛWµW

].

(E.1)

This expression is a quadratic in Wk. Identifying the terms linear and quadratic in Wk allows the

distribution to be determined.

Quadratic: ΛW + ΛU

N∑i=1

f2k,i

Linear: ΛU

N∑i=1

‖fi‖−1fk,i(Ui − µU − ‖fi‖−1∑j 6=k

Wjfj,i) + ΛWµW .

(E.2)

101

Appendix E. Meta Constrained PMF 102

Therefore, the sampling distribution of Wk is Gaussian, with parameters

ΛWk=ΛW + ΛU

N∑i=1

f2k,i

µWk=Λ−1Wk

ΛU

N∑i=1

‖fi‖−1fk,i(Ui − µU − ‖fi‖−1∑j 6=k

Wjfj,i) + ΛWµW

.

(E.3)

E.2 User Features

Under the model, the terms involving the sampling distribution for Wi is the prior for Ui and the

likelihood of all ratings ri,j for user i,

log p(Ui | · · · ) = log p(Ui | · · · ) +∑j

log p(ri,j | · · · )

=− 1

2(Ui − µU − ‖fi‖−1

∑k

Wkfk,i)>ΛU (Ui − µU − ‖fi‖−1

∑k

Wkfk,i)

− 1

2

∑j

Ii,j(ri,j − U>i Vj)>τ(ri,j − U>i Vj)

=− 1

2

U>i ΛuUi − 2UiΛU (µU + ‖fi‖−1

∑k

Wkfk,i)

+ (µU + ‖fi‖−1∑k

Wkfk,i)>ΛU (µU + ‖fi‖−1

∑k

Wkfk,i)

− 1

2τ∑j

Ii,j[r2i,j − 2ri,jU

>i Vj + V >j UiU

>i Vj

].

(E.4)

This expression is a quadratic in Ui. Identifying the terms linear and quadratic in Ui allows the

distribution to be determined.

Quadratic: ΛU + τ∑j

Ii,jVjV>j

Linear: τ∑j

Ii,jri,jVj + ΛU (µU + ‖fi‖−1∑k

Wkfk,i).(E.5)

Therefore, the sampling distribution of Ui is Gaussian, with parameters

ΛUi =ΛU + τ∑j

Ii,jVjV>j

µUi =Λ−1Ui

τ∑j

Ii,jri,jVj + ΛU (µU + ‖fi‖−1∑k

Wkfk,i)

.

(E.6)

Note that the precision of the user feature is unaffected by the inclusion of the side features.

Appendix F

User Features in Matrix

Factorization with User Networks

Vanilla Bayesian PMF models the ratings as Gaussian conditional on user and item features. Hierarchi-

cally, these features are given Gaussian-Wishart priors

(ri,j | Ui, Vj , τ) ∼N (ri,j | γi + ηj + U>i Vj , τ)


(Vj | µV ,ΛV ) ∼N (Vj | µV ,ΛV )

(µU ,ΛU ) ∼N (µU | µ0, β0ΛU )W(ΛU | ν0,W0)

(µV ,ΛV ) ∼N (µV | µ0, β0ΛV )W(ΛV | ν0,W0).

(F.1)

When user networks are present, we have considered a modified version of the above.

(ri,j | Si, Vj , τ) ∼N (ri,j | S>i Vj , τ)


(Si | µS ,ΛS) ∼N (Si | µS + ‖ai‖−1∑j 6=i

Ujai,j ,ΛS)

(Vj | µV ,ΛV ) ∼N (Vj | µV ,ΛV )

(µU ,ΛU ) ∼N (µU | µ0, β0ΛU )W(ΛU | ν0,W0)

(µS ,ΛS) ∼N (µS | µ0, β0ΛS)W(ΛS | ν0,W0)

(µV ,ΛV ) ∼N (µV | µ0, β0ΛV )W(ΛV | ν0,W0),

(F.2)

where ai,j ∈ 0, 1 indicate the presence of an edge between users i and j. In this appendix, we

derive the posterior sampling distribution for Ui and Si, conditional on the rating data and the user

network.

103

Appendix F. User Features in Matrix Factorization with User Networks 104

F.1 Sampling Distribution for Ui

The conditional posterior distribution for Ui will involve two terms: the prior for Ui, and the prior for

Sk for all users k that are connected to i. Since each of these terms are Gaussian in distribution, the

conditional posterior will be Gaussian. In what follows, we expand each term individually, recognize the

linear and quadratic factors, and finally combine the three sets to obtain the required distribution.

F.1.1 Prior for Ui

The log-prior for Ui contributes the following

log p(Ui) ∝(Ui − µU )>ΛU (Ui − µU )

=[U>i ΛUUi − 2U>i ΛUµU + µ>UΛUµU

].

(F.3)

This contributes the following linear and quadratic terms

Linear: ΛUµU

Quadratic: ΛU(F.4)

F.1.2 Prior for Sk

The log-prior for Sk contributes the following

log p(Si) +∑k 6=i

log p(Sk). (F.5)

It is necessary to distinguish the case k = i and k 6= i as the Ui term appears differently in each.

For k = i

log p(Si) ∝(Si − µS − Ui − ‖ai‖−1∑j

Ujai,j)>ΛS(Si − µS − Ui − ‖ai‖−1

∑j

Ujai,j)

=(Si − µS − ‖ai‖−1∑j

Ujai,j)>ΛS(Si − µS − ‖ai‖−1

∑j

Ujai,j)

− 2U>i ΛS(Si − µS − ‖ai‖−1∑j

Ujai,j)

+ U>i ΛSUi.

(F.6)


Linear: ΛS(Si − µS − ‖ai‖−1∑j

Ujai,j)

Quadratic: ΛS

(F.7)

For k 6= i

Appendix F. User Features in Matrix Factorization with User Networks 105

log p(Sk) ∝(Sk − µS − Uk − ‖ak‖−1∑j

Ujak,j)>ΛS(Sk − µS − Uk − ‖ak‖−1

∑j

Ujak,j)

=(Sk − µS − Uk − ‖ak‖−1∑j 6=i

Ujak,j)>ΛS(Sk − µS − Uk − ‖ak‖−1

∑j 6=i

Ujak,j)

− 2‖ak‖−1ak,iU>i ΛS(Sk − µS − Uk − ‖ak‖−1∑j 6=i

Ujak,j)

+ ‖ak‖−2a2k,iU>i ΛSUi.

(F.8)


Linear: ΛS∑k 6=i

‖ak‖−1ak,i(Sk − µS − Uk − ‖ak‖∑j 6=i

Ujak,j)

Quadratic: ΛS∑k 6=i

a2k,i‖ak‖

(F.9)

The posterior sampling distribution for Ui is therefore Gaussian, with precision being the sum of

the three quadratic terms in Equations F.4, F.7, and F.9, and mean being the the inverse precision

multiplied by the three linear terms in Equation F.4, F.7.

F.1.3 Sampling Distribution for Si

The sampling distribution for Si follows from the standard Gaussian Matrix Factorization model with

only user and item features Ui, Vj , with Si in place of Ui.

Bibliography

[1] R.P. Adams, G.E. Dahl, and I. Murray. Incorporating Side Information in Probabilistic Matrix

Factorization with Gaussian Processes. Arxiv preprint arXiv:1003.4944, 2010.

[2] S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly.

Video suggestion and discovery for youtube: taking random walks through the view graph. In

Proceeding of the 17th international conference on World Wide Web, pages 895–904, 2008.

[3] P. Damien and S.G. Walker. Sampling truncated normal, beta, and gamma densities. Journal of

Computational and Graphical Statistics, 10(2):206–215, 2001.

[4] K. Drakakis, S. Rickard, R. de Frin, and A. Cichocki. Analysis of financial data using non-negative

matrix factorization. In International Mathematical Forum, volume 3, pages 1853–1870, 2008.

[5] Rana Forsati, Mehrdad Mahdavi, Mehrnoush Shamsfard, and Mohamed Sarwat. Matrix factoriza-

tion with explicit trust and distrust relationships. arXiv preprint arXiv:1408.0325, 2014.

[6] Chris Fraley and Adrian E. Raftery. Bayesian Regularization for Normal Mixture Estimation and

Model-Based Clustering. Technical Report 486, Department of Statistics, 2005.

[7] Sheetal Girase, Debajyoti Mukhopadhyay, et al. Role of Matrix Factorization Model in Collaborative

Filtering Algorithm: A Survey. arXiv preprint arXiv:1503.07475, 2015.

[8] W.T. Glaser, T.B. Westergren, J.P. Stearns, and J.M. Kraft. Consumer item matching method and

system, February 21 2006. US Patent 7,003,515.

[9] Jennifer Golbeck. FilmTrust: Movie Recommendations from Semantic Web-based Social Networks.

In ISWC2005 Posters & Demostrations, pages PID–72, 2005. printed proceedings only.

[10] Prem Gopalan, Jake M Hofman, and David M Blei. Scalable recommendation with poisson factor-

ization. arXiv preprint arXiv:1311.1704, 2013.

[11] Prem Gopalan, Francisco J Ruiz, Rajesh Ranganath, and David M Blei. Bayesian Nonparametric

Poisson Factorization for Recommendation Systems. In AISTATS, pages 275–283, 2014.

[12] Asela Gunawardana and Christopher Meek. A unified approach to building hybrid recommender

systems. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09, pages

117–124, New York, NY, USA, 2009. ACM.

[13] Guibing Guo, Jie Zhang, and Neil Yorke-Smith. A Novel Bayesian Similarity Measure for Recom-

mender Systems. In IJCAI, 2013.

106

Bibliography 107

[14] F Maxwell Harper and Joseph A Konstan. The MovieLens Datasets: History and Context. ACM

Transactions on Interactive Intelligent Systems (TiiS), 5(4):1–9, 2015.

[15] SeyedMohsen Jamali. Probabilistic Models for Recommendation in Social Networks. PhD thesis,

Applied Sciences: School of Computing Science, Simon Fraser University, 2013.

[16] Michal Kosinski, David Stillwell, and Thore Graepel. Private traits and attributes are pre-

dictable from digital records of human behavior. Proceedings of the National Academy of Sciences,

110(15):5802–5805, 2013.

[17] B. Lakshminarayanan, G. Bouchard, and C. Archambeau. Robust Bayesian Matrix Factorisation.

Journal of Machine Learning Research, 15, 2011.

[18] Joonseok Lee, Mingxuan Sun, and Guy Lebanon. A comparative study of collaborative filtering

algorithms. arXiv preprint arXiv:1205.3193, 2012.

[19] Yew Jin Lim and Yee Whye Teh. Variational Bayesian approach to movie rating prediction. In

Proceedings of KDD cup and workshop, volume 7, pages 15–21, 2007.

[20] Hao Ma, Irwin King, and Michael R Lyu. Learning to recommend with social trust ensemble.

In Proceedings of the 32nd international ACM SIGIR conference on Research and development in

information retrieval, pages 203–210. ACM, 2009.

[21] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King. Sorec: social recommendation using

probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information and

knowledge management, pages 931–940. ACM, 2008.

[22] Benjamin M Marlin. Modeling User Rating Profiles For Collaborative Filtering. In NIPS, pages

627–634, 2003.

[23] B.M. Marlin and R.S. Zemel. Collaborative prediction and ranking with non-random missing data.

In Proceedings of the third ACM conference on Recommender systems, page 512. ACM, 2009.

[24] B.M. Marlin, R.S. Zemel, S. Roweis, and M. Slaney. Collaborative filtering and the missing at

random assumption. In Uncertainty in Artificial Intelligence: Proceedings of the 23rd Conference,

volume 47, page 5054, 2007.

[25] Paolo Massa and Paolo Avesani. Trust-aware recommender systems. In RecSys ’07: Proceedings

of the 2007 ACM conference on Recommender systems, pages 17–24, New York, NY, USA, 2007.

ACM.

[26] Paolo Massa, Kasper Souren, Martino Salvetti, and Danilo Tomasoni. Trustlet, open research on

trust metrics. Scalable Computing: Practice and Experience, 9(4), 2001.

[27] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[28] Saralees Nadarajah and Samuel Kotz. R Programs for Computing Truncated Distributions. Journal

of Statistical Software, 16(Code Snippet 2), 2006.

[29] Ulrich Paquet, Blaise Thomson, and Ole Winther. A hierarchical model for ordinal matrix factor-

ization. Statistics and Computing, 22(4):945–957, 2012.

Bibliography 108

[30] I. Porteous, A. Asuncion, and M. Welling. Bayesian matrix factorization with side information and

Dirichlet process mixtures. In AAAI Conference on Artificial Intelligence, pages 563–568, 2010.

[31] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov

chain Monte Carlo. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, ICML,

volume 307 of ACM International Conference Proceeding Series, pages 880–887. ACM, 2008.

[32] Ruslan. Salakhutdinov and Andriy Mnih. Probabilistic Matrix Factorization. In Advances in Neural

Information Processing Systems, volume 20, 2008.

[33] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines for

collaborative filtering. In ICML ’07: Proceedings of the 24th international conference on Machine

learning, pages 791–798, New York, NY, USA, 2007. ACM.

[34] United States Securities and Exchange Commission. Registration No. 333-179287, 2012.

[35] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Advances

in artificial intelligence, 2009.

[36] Jiliang Tang, Huiji Gao, and Huan Liu. mTrust: discerning multi-faceted trust in a connected

world. In Proceedings of the fifth ACM international conference on Web search and data mining,

pages 93–102. ACM, 2012.

[37] Jiliang Tang, Huiji Gao, Huan Liu, and Atish Das Sarma. eTrust: Understanding trust evolution in

an online world. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge

discovery and data mining, pages 253–261. ACM, 2012.

[38] Truyen Tran, Dinh Q. Phung, and Svetha Venkatesh. Thurstonian Boltzmann Machines: Learning

from Multiple Inequalities. In ICML (2), volume 28 of JMLR Proceedings, pages 46–54. JMLR.org,

2013.

[39] Martin J. Wainwright and Michael I. Jordan. Graphical Models, Exponential Families, and Varia-

tional Inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.

[40] Jason Weston, Chong Wang, Ron J. Weiss, and Adam Berenzweig. Latent Collaborative Retrieval.

In ICML. icml.cc / Omnipress, 2012.

[41] YouTube. YouTube Statistics.

[42] Yisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversified

retrieval. In Advances in Neural Information Processing Systems, pages 2483–2491, 2011.

Documents

by Cody Severinski · Many extensions of the baseline Probabilistic Matrix Factorization model have been proposed in the literature, and as expected, all perform better than the baseline