Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Augmenting Probabilistic Matrix Factorization Models for Rare Users
by
Cody Severinski
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Statistical SciencesUniversity of Toronto
c© Copyright 2016 by Cody Severinski
Abstract
Augmenting Probabilistic Matrix Factorization Models for Rare Users
Cody Severinski
Doctor of Philosophy
Graduate Department of Statistical Sciences
University of Toronto
2016
Recommender systems are used for user preference prediction in a variety of contexts. Most commonly
known for movie suggestion from the Netflix competition, these systems have evolved to cover generic
product recommendation, friend suggestion, and even online dating. Matrix Factorization is a common
model employed for several reasons. Among them, they scale well, are easily learned, and can be adapted
to different contexts.
Many extensions of the baseline Probabilistic Matrix Factorization model have been proposed in the
literature, and as expected, all perform better than the baseline with reported test results. We review
several of these extensions, notably: constraints based on similar rating patterns among users, allowing
for nonconstant variance / precision in the model, introducing personal information on the users as
constraints, and including user networks in the model. These models are extended to the Bayesian
framework where necessary. We illustrate how these extensions perform overall, and for sets of users
defined by different number of ratings at training time. In particular, we highlight the benefit of many
of these extensions for infrequent users (those with few or no ratings in the system). This is particularly
important as these users are the most common in the recommendation framework.
In the case of user networks, we additionally study the robustness of the model in the presence of
random links. This reflects the true state of user networks in applications such as Facebook, where social
ties may not convey similar taste in preferences.
In addition, we provide the first direct comparison of the performance of the models learned from
Gibbs sampling and variational inference. Limitations of the variational algorithm are outlined for
multiple models, with proposals given for alleviating overfitting.
ii
Acknowledgements
There are many people I wish to thank, as they have all been crucial to my success:
• My supervisor, Professor Ruslan Salakhutdinov. Your support, guidance, and encouragement have
been critical to this. You have offered support in all of modeling, theory, inference, and coding /
computation issues. I am grateful for all of this;
• My committee members: Professor Nancy Reid and Professor Jeff Rosenthal. You both have been
instrumental in guiding me along the way, reviewing work, and offering guidance for the next steps;
• My external: Professor Mu Zhu from the University of Waterloo. You raised some very important
questions in your report, which allowed me to make higher level connections between various
aspects of the thesis work;
• The Chair of the department: Professor James Stafford. You encouraged and supported me with
my first teaching role at the University of Toronto (Statistics 261), assisted with the organization
and planning of campus wide Research Day events, and took a sincere interest in my work and
development during the degree;
• My family, and in particular my parents: Leanne and Gary Severinski. If there are two people in
my life that deserve anything, it is them. Thank you Mom and Dad, for always being a phone call,
train, flight, or transit ride away;
• The Graduate Administrator in my department: Andrea Carter. You have also been key to the
completion of my degree, and to several other aspects of my time at the University of Toronto. You
have ensured that paperwork was completed on time, assisted with logistical / teaching matters
when I served as a course instructor, and coordinated everything with multiple deadlines. Most
importantly, you were always there to talk, and also to listen;
• Two other administrative members in the department: Christine Bulguryemez and Angela Fleury.
During my PhD, I helped organize campus wide Research Day events. You both were there, behind
the scenes, making sure the speaker’s flights, accommodations, and event logistics were arranged
for;
• My personal friend, Professor of Mathematics at UBC, and President of the UBC Faculty Associa-
tion: Professor Mark Mac Lean. You have been supportive, both academically and personally. You
have consistently provided an unbiased estimate of my work and development during my degree.
As a Statistician, I am grateful for your consistent and unbiased estimates;
• A faculty at the University of Toronto: Professor Alison Gibbs. You have been my mentor over
the years in developing my teaching abilities. In many ways, you oversaw the development of my
teaching while my supervisor oversaw the development of my research. Your name as a teaching
reference has been crucial with having secured sessional offers from multiple other universities;
• Fellow PhD: Alex Shestopaloff. You have provided external critiques and reviews of my work over
time, and you were there to help me keep everything in perspective.
• Friends who supported me in one way or another: Eric Peng, Uyen Hoang, Sian Hoe Cheong,
Erwin Alexander Ketterer, Anton Babadjanov, Suzanne Wasmund, Patrick Halina, Theri Kay,
Stefan Attig, Juveria Ghare, and Zita Baryonyx Poon;
iii
• NSERC, for providing funding through the NSERC CGS program.
iv
Contents
1 Introduction 1
1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hierarchy of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Content Based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Connection to Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Data Sets Used in this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Matrix Factorization 8
2.1 Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Bayesian Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Constrained Bayesian PMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7.1 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.3 Side Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Precision Models for Matrix Factorization 25
3.1 Existing Noise Models for BPMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Truncated Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
v
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.1 Variational inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.3 Truncated Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.4 Overfitting in the Robust Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.5 Side Features with Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Meta-Constrained Latent User Features 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 User Meta Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6.1 PCA for User Meta Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.7 MAP Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 A Generative Model for User Network Constraints in Matrix Factorization 55
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Pseudo-Generative Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 A Special Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6.1 Pathological Network Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6.2 Shift Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6.3 Fake Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Conclusion 75
Appendices 77
A Ancillary Results and Derivations 78
A.1 Squared Error Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.1.1 Quadratic with Respect to User Features . . . . . . . . . . . . . . . . . . . . . . . 78
A.1.2 Quadratic with Respect to Item Features . . . . . . . . . . . . . . . . . . . . . . . 79
A.1.3 Quadratic with Respect to Side Features . . . . . . . . . . . . . . . . . . . . . . . . 79
A.2 Expectation of Certain Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.2.1 Expectation of Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
vi
A.2.2 User Quadratic Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.2.3 Gamma Random Variable Expectation . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.2.4 Wishart Random Variable Expectation . . . . . . . . . . . . . . . . . . . . . . . . 81
B Constrained PMF 82
B.0.5 Conditional Posterior for Side Feature . . . . . . . . . . . . . . . . . . . . . . . . . 82
B.0.6 Conditional Posterior for User Feature . . . . . . . . . . . . . . . . . . . . . . . . . 84
C Distributional form of the Variational Approximation 85
C.1 User Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
C.2 User Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.3 Item Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.4 Side Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
C.5 Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
C.6 User / Item / Side Feature Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 88
D Derivation of the Variational Lower Bound 90
D.1 Complete Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
D.1.1 Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
D.1.2 User Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
D.1.3 User Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
D.1.4 User Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
D.1.5 User Hyperparamters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
D.2 Etropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
D.2.1 Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
D.2.2 Precision Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
D.2.3 User Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
E Meta Constrained PMF 101
E.1 Meta Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
E.2 User Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
F User Features in Matrix Factorization with User Networks 103
F.1 Sampling Distribution for Ui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
F.1.1 Prior for Ui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
F.1.2 Prior for Sk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
F.1.3 Sampling Distribution for Si . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Bibliography 106
vii
Chapter 1
Introduction
1.1 Framework
When confronted with a limited selection of options, humans are typically able to quickly review and
select a preferred option. They may even be able to rank such options. The simplest such decision would
be a binary one. For instance, if someone is to watch the movie in theatre one or theatre two.
Current technology makes it easy to categorize, store, and present to a user a wealth of possible
choices. There are over 42 million Facebook pages[34] and hundreds of millions of hours on YouTube[41,
27]. This presents new challenges. The most apparent is that the number of options to present to a user
is large. It is not possible for a user to review a representative list of all options. A naive solution would
be to present a subset of the most universally liked options. The presentation of these globally popular
items removes the personalized aspect of the recommendation, while also creating a self-reinforcement
of the popularity of these items.
The goal of a Recommender System is to remove the need for manual discovery of preferred items by
predicting the individual tastes of a user for new and unknown items. A good recommender system will
yield personalized recommendations with high precision. Personalized recommendations are unique to
the user’s interest, and not necessarily globally popular items. Given the large corpus of items typically
available, a good recommender system must also scale to large sets of users and items, and must also be
computationally efficient.
1.2 Hierarchy of Models
A main division in the design of recommender systems is the use of user and item level features. Systems
that extract features from the users and items are content-based recommender systems. Systems that
do not extract features from the users and items are collaborative based recommender systems.
1.2.1 Content Based Systems
Content based recommender systems share characteristics with many supervised learning models: given
a set of features (in this case, features on users and items), extract a collection of features to predict
new preferences for users. These features are also referred to as “domain knowledge”, reflecting the fact
1
Chapter 1. Introduction 2
that the features can change from domain to domain. The features used for music recommendation, for
instance, will be different from the features used for movie recommendation.
To give a concrete example, The Music Genome Project is patented by Pandora Media[8]. The project
is a recommendation system for music. Songs are characterized by approximately 150 features. Typical
features include ”...gender of lead vocalist, level of distortion on the electric guitar, type of background
vocals, etc”. With these features, similarities between songs can be computed, and recommendations
can be made to users based on listening history.
Contrast this to the case of content based recommendation of movies. Typical systems will use
features such as genre, lead actor, director, and other character-based features for recommendation.
This follows from the fact that people tend to follow genres and people with movies. As mentioned,
these features are different from the set of features for music. This difference in domain knowledge can
make cross-domain recommendation difficult in the context of content based recommendation. Different
features, and likely different recommendation systems, will be needed.
1.2.2 Collaborative Filtering
Collaborative filtering methods contrast from content based recommendation systems in that they do
not use domain knowledge of the users and the items. Collaborative filtering methods rely on a user-item
rating matrix. The ratings may be given explicitly by the user or obtained implicitly through indirect
means. Examples of explicitly given ratings would be:
• Movie ratings on Netflix, the Internet Movie Database;
• Video ratings on YouTube;
• Liking a Facebook page;
• Rating a business on Yelp;
While examples of implicitly given ratings are:
• Satisfaction of a streaming movie based on the a threshold of time the user viewed the video;
• Interest in an email campaign for products based on email click-through-rate (CTR);
• Repeated and related search queries for a topic;
The user-item rating matrix is a sparse matrix. The matrix is sparse since most users typically rate
only O(1) or O(10) items. For instance, the median number of ratings for a typical user in the data set
from the Netflix competition is 96 (average is 209), while the median number of likes for a Facebook
page from a prior study was 68 (average is 170) [16].
However, the user-item rating matrix is not uniformly sparse. Some users are much more frequent
than others. A movie critic, for instance, can be expected to rate many more movies than the average
user. The top 10% of users in the Netflix data set rate over 500 movies. Similarly, some Facebook pages
will have many more likes than others. One illustration of this imbalance is the Hillary Clinton page,
which has approximately 1.6 million likes as of October 2015.
The user-item rating matrix is also partially observed. The feedback received from user i for item j
sets the value of the (i, j) entry of the rating matrix to be observed. These entries with explicit feedback
are the observed set of ratings.
Chapter 1. Introduction 3
Ω =(i, j) | Ri,j 6= 0.
The complement of this set, the cells with zero entries, are properly modeled as missing values in
most contexts. These values are not missing at random, and has been shown to generate a bias in the
observed ratings [23, 24]. The act of providing explicit feedback means that user has an interest in the
item, and this interest is generally more favourable than what would have been observed under random
sampling. Similarly, the exclusion of a rating does not mean the user has no interest in the item. In
many applications, a zero entry for the (i, j) entry means that user i is not aware of item j. The related
problem of recommending a set of items of interest that is also diverse in coverage is a closely related
problem that has been studied[42].
Memory Based CF Systems
Neighbourhood methods are a very common memory based CF system, and are also some of the most
intuitive and well-known [18, 35]. A distance metric is defined for two users (alternatively, two items),
corresponding to two rows (alternatively, two columns) of the user-item rating matrix. When the distance
is defined between two users, they are known as user-user similarity systems. When the distance is defined
between two items, they are known as item-item similarity systems.
To make recommendations in a user-user system, the distance from the target user to all other users
is computed. Other users can be ranked based on distance, and the items these users rated is aggregated.
The items already considered by the target user are removed from the list, and the remaining items can
be ranked for recommendation purposes.
To make recommendations in an item-item system, the distance from items rated by the target user
is computed for other items. The unobserved items can be ranked based on distance to the observed
items, and presented to the user.
Any two given users typically have few ratings in common, since both users rate a fraction of a
percent of items on average. This makes it difficult to compute distance between two users, on average.
Item-item based methods are also prone to this sparsity problem.
Memory-based systems include graph based recommendation algorithms. Common instances are
bipartite graphs of users and items, with directed edges from users to items if a user expressed a rating
for an item. Industry research has been published outlining the use of these methods for YouTube[2].
Memory-based CF filtering systems are fast to train [35]. In particular, user-user and item-item
similarity methods can easily be distributed, allowing the model to expand and maintain fast computation
time by adding more processors. In addition, these memory-based systems are not parametric. In
practice, inference for the parameters in a parametric model is iterative. However, the simplicity of
these systems tend to lead to larger errors than other more complex approaches, discussed below.
Model Based CF Systems
One approach to CF systems is to assume a parametric model that can be learned using historical
data and used to predict future data. Model based systems include regression methods, belief nets,
Boltzmann-machines, LDA-style models, and factor analysis / matrix factorization models. These are
probabilistic models, often placed in a Bayesian framework. In this section, these models are briefly
Chapter 1. Introduction 4
reviewed. Being the focus of this work, a more detailed discussion of matrix factorization models is in
Chapter 2.
To make recommendations, model based CF systems are first trained to learn model parameters.
Once trained, the learned set of parameters can be used to make predictions of new ratings. Training
for these models is often relatively slow compared to memory based CF systems. In many cases, the
objective cannot be directly solved and iterative approximate inference methods must be used. Two
common methods in the literature are variational methods [22, 19] and Markov Chain Monte Carlo
[31, 10].
Gaussian Matrix Factorization
A common approach is to assume a low-rank approximation to the user-item matrix[7, 4, 1, 30]. The
N ×M rating matrix R is modeled as the product of a K ×N matrix of user features U and a K ×Mmatrix of item features V , such that R = U>V . This approach has an interpretation in terms of latent
user and latent item features. Each user ui is represented by a latent k-vector Ui, the ith column of
the matrix U . Each item vj is represented by a latent k-vector Vj , and the rating given by user ui to
item vj is the inner product U>i Vj . The features Ui, Vj are modeled as samples from i.i.d. Gaussian
distributions.
Ui ∼N (Ui | 0, λUI)
Vj ∼N (Vj | 0, λV I).
The unbounded and real-valued nature of the features implies that the predicted rating is also un-
bounded and real-valued. This is not reality, but is often a good approximation and provides sufficiently
accurate results for industry applications.
The dimensionality of the feature vectors is limited by computing resources. In addition, the per-
formance of these PMF models trained using gradient descent is prone to overfitting as K increases,
which demands more tuning of the parameters λU , λV . The Bayesian extension, discussed further in
Chapter 2, is not susceptible to overfitting as K increases[31].
Hierarchical Poisson Factorization
Recent work has proposed a matrix factorization model where the entries of the rating matrix are
modeled by a Poisson distribution [10]. Similar to Gaussian matrix factorization, the mean of ri,j in
Poisson factorization is modeled by the inner product of a latent user feature θi and a latent item feature
βj . Independent Gamma priors are placed on each user component θi,k and each item component βj,k,
with hyper-priors placed on the rate parameters for each. This model is conjugate, and fast variational
inference methods have been derived.
This model has several properties that are desirable in comparison to the Gaussian matrix factor-
ization model. In brief, the Poisson model better captures sparsity and skewness. In real data sets,
some users / items are best represented by lower-dimensional features than others (corresponding to less
latent features for the users / items), and the Gamma priors encourage this sparsity. The skewness in
the frequency of the number of ratings per user and per item are also modeled more readily by Poisson
matrix factorization, verified by Bayesian posterior predictive checks[10].
Hierarchical Poisson Factorization has been extended to a nonparametric framework by allowing the
Chapter 1. Introduction 5
dimensionality K of the features θi and βj to have arbitrary support [11]. This nonparametric model was
found to have comparable or superior recall and precision across multiple data sets including MovieLens
and Netflix.
(R)BM Variants
With the rise of interest in deep learning methods, Boltzmann-machine style models have been increas-
ingly used. One proposed model for the Netflix competition data set was based on Restricted Boltzmann
Machines (RBMs)[33]. RBMs are an undirected graphical model with a single layer of visible units and
a single layer of hidden units. There are connections between the hidden and visible layers, but no
connections between hidden units and no connections between visible units. In this approach, each user
in the system was modeled by an RBM, with a set of visible units for the movies the user rated. The
parameters for the connectivity weights and the biases were shared between the users.
Other work has looked at the use of general Boltzmann machines for binary rating prediction[12].
A Boltzmann machine (BM) consists of a single layer of fully connected hidden units. Similar to the
RBM models, each user was represented by a single BM, with the layer of hidden units representing the
set of movies to be recommended. For training purposes, ordinal five star ratings were dichotomized to
correspond to intuitive like / dislike. The weights of the BM were tied across users, and parametrized
by item content. It was shown that the parametrization by item content improved prediction accuracy,
again analogous to the low-rank factorization of the weight matrix in the RBM model referenced above
[33].
Ordinal Models
More complicated models have been developed that take into account the ordinal nature of the ratings.
A latent real-valued variable is generated by these models, which are then digitized into ordinal rankings
based on a partition of the real line[29]. In particular, this model defines for R ranked values, a partition
of the real line
−∞ = b1 < b2 < · · · < bR+1 = +∞,
with a real-valued prediction fi,j for user i on item j corresponding to rank ri,j if br ≤ fi,j < br+1.
Other work has looked at more general constraint-based models[38]. These models also rely on the
concept of having ordinal (or other) ratings generated from latent continuous random variables. These
models are more general than the ordinal matrix factorization models in that they allow for general
constraints on the observations, of which the ordinal inequalities are one. Other possible constraints
include censored and binary observations.
1.2.3 Evaluation Metrics
Earlier recommendation system work focused on using root mean square error (RMSE),
RMSE =
√1
N
∑i∈I
(ri,j − ri,j)2
=
√1
N
∑i∈I
(ri,j − U>i Vj)2,
Chapter 1. Introduction 6
where I is a set of (i, j) pairs for which the prediction of the ratings ri,j are evaluated. The pre-
dominance of this metric was motivated by its adoption in the Netflix competition as the evaluation.
This metric is convenient as it is convex in Ui and Vj , but often impractical. The drawbacks to this
metric include its ignorance of the ordinal nature of the ratings. Assume for the following that ratings
are given on a 5-star ordinal scale. In practice, it is more important to have high accuracy in predicting
the ratings for items rated as 5 than items rated as 1. In addition, it can be argued from a marketing
viewpoint that it is more costly to predict a true 1 as a 5 than a true 5 as a 1. The latter error hides
one of many desired items from a user, while the former presents an uninterested / unwanted item to
the user. Since the number of items recommended to the user is typically small (typically five or less),
the presentation of a undesired item can taint the user experience.
Other work has evaluated using mean absolute error (MAE),
MAE =1
N
∑i∈I|ri,j − ri,j |.
This metric is closely related to RMSE, except it does not penalize more extreme deviations in
prediction error.
1.3 Connection to Thesis Work
The contributions of this thesis are in extensions to Gaussian Matrix Factorization. Specifically, exten-
sions of a vanilla matrix factorization model are reviewed, modifications proposed to deal with existing
issues, and evaluated for test performance.
At a high level, these extensions incorporate additional information into the prior by modifying the
mean or variance / covariance of the latent (user) features. Specifically:
• Chapter 2 defines and extends Constrained Probabilistic Matrix Factorization to the Bayesian
framework. This extension introduces a new set of latent features, Wk, k = 1, . . . ,M , for the
items. In contrast to the explicitly provided ratings, these features model implicit information
from the user (ex: viewed item, downloaded app, streamed video for more than x minutes, etc.).
These features are then used to shift the expected rating, E[ri,j ];
• Chapter 3 reviews existing work on heteroskedastic, or nonconstant, precision models. Overfitting
issues under variational inference are outlined, and a truncated precision model is proposed to
alleviate this. The performance of the truncated model is outlined, and the model is theoretically
connected to existing work through limiting cases;
• Chapter 4 outlines a model that includes user demographics, or meta features, into the PMF
framework. Similar to Chapter 2, this extension shifts the prior mean of a user feature by an
average of new latent features, each one tied to a demographic. An additional extension using
principal components is outlined and demonstrated to improve prediction;
• Chapter 5 introduces the concept of networks among users. These networks are observed in many
different cases (ex: friendship networks, trust networks), and have been shown to improve the
prediction. Limitations with existing approaches are outlined, and a new model is proposed that
overcomes these issues. The proposed model is demonstrated in the case of fully observed and
Chapter 1. Introduction 7
partially observed networks. Future work is outlined with respect to the performance of this
model under completely random networks.
The experiments in this thesis were frequently attempted with multiple choices of the rank d. The
results results used two specific choices for d. For chapters 2-3, d = 20 dimensional features were used.
For Chapters 4-5, d = 10 dimensional features were used. Both of these values are commonly used in
the literature [31, 32, 21, 40].
By leaving the extensions (mostly) independent of the specific distributional assumptions placed on
the latent features, the extensions can be adapted to other models. For instance, the user features in a
hierarchical Poisson Factorization model can be shifted in the same manner that user features for the
Gaussian models considered in this thesis are shifted.
1.4 Data Sets Used in this Thesis
This thesis makes use of multiple data sets for experimental results for various models.
The MovieLens 1M data set consists of 1,000,209 (user, item, rating) triplets for 6,040 users on 3,900
movies provided on the MovieLens service in 2000 [14]. The user set is heavily censored as each user
has at least 20 ratings the data. Timestamps are present in the original data set, but are not used
here in this thesis work. Auxiliary information both on the users and items are available. In particular,
for each user, we are given: gender, age (dichotomized into six bins), occupation (dichotomized into 21
categories), and zip code.
The Epinions data set was gathered over a five week crawl of the Epinions site, a product review
site, in 2003. There are 49,290 users and 139,738 items in the data set. In addition, trust statements
are given for 487,181 pairs of users [26, 25]. On this site, users can indicates other users they trust the
reviews of. This forms the trust statements, a directed network.
Flixster is a social networking site for movie fans. The Flixster data set [15] consists of 6,160,927 rat-
ings for 109,218 users on 42,173 items. Users are able to invite other users onto the site and form “friend-
ships”. The data set also contains an undirected social network among users consisting of 1,347,222 links
among pairs of users, as well as limited demographic information for a user (gender, location, and age).
Some of the demographic information is missing or nonsensical.
Ciao (ciao.co.uk) is another product review site where users can form similar trust statements. The
Ciao data set [36, 37] consists of 284,086 ratings from 7,375 users on 106,797 items. For each (user, item,
rating) triplet, there is additionally a category for the product, and a helpfulness score for the rating.
The ratings data is accompanied by a directed network among users with 111,781 edges between users.
FilmTrust is a movie rating / review service. Users in this service can add “friends” and need to
indicate trust values for their “friends”. The FilmTrust data set [13, 9] was generated from a site crawl
in June 2011. There are 35,497 ratings from 1,508 users for 2,071 items. This is the smallest of the data
sets we consider. There is additionally a directed network among users with a total of 1,853 edges.
Chapter 2
Matrix Factorization
Factor based models have been used extensively in the collaborative filtering domain for preference
matching between two sets of objects. In the user recommendation framework, users are one set of
objects, and the other set is some generic collection of items. Common industry applications are videos
(ex: YouTube, Netflix), products (ex: Amazon), mobile apps (ex: Google Play), or other users (ex:
Facebook, LinkedIn, OkCupid). Factor based models are content-less: they require no content extraction
or feature generation from the users and items. The content-less nature of factor based models allows
these models to easily adapt from one item domain to another (videos to music), or even to domains
with multiple item contexts (ex: generic product recommendation). Instead, these models assume there
are a small number of unobserved latent features associated with each user and item that determine
preferences.
Matrix factorization models are a common factor based model. Given N users, M items, and a (user,
item) matrix of preferences R ∈ RN×M , these models approximate R as the product of two low rank
matrices such that R ≈ U>V , where U ∈ Rd×N , V ∈ Rd×M . Each column in U is the d-dimensional
latent feature of a user and each column in V the d-dimensional latent feature of an item. The ri,j entry
can be reconstructed as the inner product U>i Vj , where Ui is the ith column of U , and Vj is the jth
column of V. The problem of estimating U and V can be approached as an incomplete SVD problem:
find the best approximation to the partially observed matrix R given some loss function.
A common probabilistic framework for matrix factorization models is to assume that each user feature
Ui and each item feature Vj are independent samples from some probability distribution, and that the
rating ri,j has a distribution given Ui, Vj , and possibly other parameters. The goal is then to make
inference of the user features Ui and item features Vj in order to make predictions of the ratings /
preferences ri,j . In the Bayesian framework, this inferential problem equates to modeling the posterior
of the user and item features given the observed ratings.
There are two common inferential approaches in the machine learning literature for this problem.
The first is grounded in Monte Carlo methods, and the second is grounded in variational inference.
Monte Carlo methods approximate the true posterior. They are often criticized for slow convergence
and computational complexity. Conversely, variational methods provide exact results to an approximate
problem. They often are favoured for computational simplicity, though the approximations employed
typically rely on the strong assumption of posterior independence among the user and item features. To
our knowledge, work in the literature commonly reports experimental results using only one of Gibbs
8
Chapter 2. Matrix Factorization 9
sampling or variational inference. Further, these results are often reported only for the overall training
and tests sets. There is rarely a discussion of the performance with respect to users with different number
of ratings in the system. Such a distinction is important as inference, and hence prediction, for a user
depends on the amount of ratings in the system for the user.
This chapter contains the first direct comparison of Gibbs sampling and variational inference in the
matrix factorization context. We report our comparative results using overall test error metrics, and
also broken down for subsets of users defined by different frequency in the training set. Our work has
the following contributions:
1. We provide a direction comparison of Gibbs sampling and variational inference for multiple matrix
factorization problems in the recommendation framework;
2. We extend the previous work on constrained probabilistic matrix factorization to the Gibbs frame-
work, highlighting the gains made by Gibbs sampling over the MAP estimate.
2.1 Probabilistic Matrix Factorization
Probabilistic Matrix Factorization (PMF) is a probabilistic linear model which models ri,j as a Gaussian
with mean U>i Vj . In the vanilla PMF model, the baseline for much of the work in this thesis, the
precision (inverse variance) of this Gaussian is a constant τ . Each column of U, V , corresponding to each
user and item, has an independent Gaussian prior placed on it. The conditional distribution over the
observed ratings R and the prior distributions over U and V is given by
(ri,j |Ui, Vj) ∼[N (ri,j | U>i Vj , τ)
]Ii,j(Ui|λU ) ∼N (Ui | 0, λUI)
(Vj |λV ) ∼N (Vj | 0, λV I),
(2.1)
where N (x | µ, τ) denotes the univariate Gaussian distribution for x with mean µ and precision τ ,
and Ii,j ∈ 0, 1 is the indicator that user i provided a rating for item j. Further, N (x|µ,Λ) denotes the
multivariate Gaussian distribution with mean vector µ and precision matrix Λ.
In practice, it is important to model the bias of each user and each item. Let γi denote the bias for
user i, and ηj the bias for item j. The mean of the predicted rating in Equation (2.1) should be properly
modeled by
E[ri,j | γi, ηj , Ui, Vj ] =γi + ηj + U>i Vj . (2.2)
In the probabilistic framework, both γi and ηj can be modeled as univariate Gaussians
(γi | λγ) ∼N (γi | 0, λγ)
(ηj | λη) ∼N (ηj | 0, λη).(2.3)
Inference for this model is performed by maximizing the log-posterior over the latent features and
biases with fixed hyperparameters,
Chapter 2. Matrix Factorization 10
log p(U, V | R, τ, λU , λV )
=
N∑i=1
M∑j=1
Ii,j log p(ri,j | Ui, Vj , τ) +
N∑i=1
log p(U | λU ) +
M∑j=1
log p(V | λV )
+
M∑i=1
log p(γi | λγ) +
N∑j=1
log p(ηj | λη).
(2.4)
2.2 Bayesian Probabilistic Matrix Factorization
Bayesian PMF extends PMF, Section 2.1, to the Bayesian framework. The likelihood of the observed
rating data is the same as in the PMF model, outlined in Equation 2.1. The Bayesian extension places
independent Gaussian priors on the latent features U and V , each with unknown means µU , µV and
precisions ΛU ,ΛV ,
(U | µU ,ΛU ) ∼N∏i=1
N (Ui|µU ,ΛU )
(V | µV ,ΛV ) ∼M∏j=1
N (Vj |µV ,ΛV ).
(2.5)
This is in contrast to the PMF model, where the features are mean zero, with spherical precisions.
This eliminates the need for the tuning parameters λU , λV , which have been shown to need careful tuning
to avoid overfitting [32].
If the biases are to be modeled, independent Gaussian priors for the γi and ηj can be included,
(γ | µγ , λγ) ∼N∏i=1
N (γi | µγ , λγ)
(η | µη, λη) ∼M∏j=1
N (ηj | µη, λη).
(2.6)
This Bayesian extension further places Gaussian-Wishart priors on the latent feature hyperparameters
µU ,ΛU, µV ,ΛV
(µU ,ΛU ) ∼N (µU |µ0, β0ΛU ) · W(ΛU |W0, ν0)
(µV ,ΛV ) ∼N (µV |µ0, β0ΛV ) · W(ΛV |W0, ν0),(2.7)
where Λ ∼ W(Λ|W0, ν0) denotes a random variable Λ drawn from a Wishart distribution with ν0
degrees of freedom and scale matrix W0.
The graphical model for Bayesian PMF is in Figure 2.1.
2.3 Constrained Bayesian PMF
Learning the baseline model in Section 2.2 improves the recommendation process over the baseline of
prediction at random. However, the improvement is not uniform over users. The learned model has the
most benefit over predicting at random for users with many ratings in the system. Users with few ratings
Chapter 2. Matrix Factorization 11
Figure 2.1: Bayesian Constrained Probabilistic Matrix Factorization with Gaussian-Wishart Priors overthe latent user, item, and side feature vectors.
ri,j
i = 1 : N
µU
ΛU
Uiµγ
λγ
γi
j = 1 : M
µV
ΛV
Vj
µη
ληηj
k = 1 : N
Wk
µW
ΛW(µ0, β0)
(W0, ν0)
τ
aτ bτ
in the system have little contribution to the features and biases from the likelihood, and so learning will
be dominated by the prior.
This notion can be formalized mathematically by considering the posterior expectation of a rating
ri,j , denoted Eposterior[ri,j ]. If there is little influence from the data, then the expected value of the
features and biases under the posterior, Eposterior, will be close to the expected value under the prior,
Eprior. Propagating this to the expected rating, we have
Eposterior[ri,j ] =Eposterior[γi + ηj + U>i Vj ]
=Eposterior[γi] + Eposterior[ηj ] + Eposterior[U>i Vj ]
≈Eprior[γi] + Eposterior[ηj ] + Eprior[Ui]>Eposterior[Vj ]
=µγ + Eposterior[ηj ] + µ>UEposterior[Vj ].
(2.8)
The first equality is from definition, the second equality is from linearity, and the subsequent approx-
imation is based on the effect of the likelihood. The final equality is from the definition of the Bayesian
model, and highlights a problem for rare users and cold start users. The first term is the regularization
from the prior, and the third is multiplicative in this regularization from the prior. Only the second
term contains signal from the likelihood, and is driven purely by the item. This means that predictions
for cold start and rare users are not personalized. These recommendations are driven by global item
information.
One way of constraining the predictions for these users is to model implicit feedback from the users.
In contrast to explicitly provided feedback (ie.: users rating items on a numeric scale), implicit feedback
can take many forms. For instance:
• Repeatedly viewing a product page
Chapter 2. Matrix Factorization 12
• Downloading an app
• Streaming a video for more than a given amount of time
In each of these cases, the action taken by the user is implicit feedback of interest [32]. To model
this feedback, let W ∈ Rd×M be a set of such constraint features, or side features. Denote each column
of the matrix as Wk ∈ Rd. Further, introduce the indicator matrix I
I =
I1,1 I1,2 · · · I1,M
I1,2 I2,2 · · · I2,M...
.... . .
...
IN,1 IN,2 · · · IN,M ,
(2.9)
where Ii,j = 1 if user i has provided implicit feedback on item j, and 0 otherwise.
With this in place, define the shifted user feature as
Si =δUUi + δW
∑Mk=1 Ii,kWk∑M`=1 Ii,`
. (2.10)
We explicitly include indicator functions δU , δW ∈ 0, 1 to ease the interpretation of the derivations
that follow. When δW = 0, the model includes only the user specific latent features Ui and reduces to
Bayesian PMF of Section 2.2. The case δU = 0 is an interesting case where there are no user features
Ui, only the shared side features Wk.
The normalization of the second term in Equation (2.10) by∑M`=1 Ii,` ensures that the magnitude of
the shift is independent of the number of items with feedback. For notational convenience, we will let
ni =∑M`=1 Ii,` denote the number of items with feedback from user i.
The intuition behind Equation (2.10) is related to the concept of taste similarity [15]. Intuitively,
Wk captures the effect on the prior mean from a user having rated (or simply viewed) a particular
item. Therefore, users with similar rating (or viewing) habits will have similar prior distributions for
the feature vectors. This is formalized in the derivation of the posterior mean for Ui in Appendix B.
In the probabilistic framework, the Wk are regularized by the same spherical zero mean Gaussian
prior as the other latent features
(Wk | λW ) ∼N (Wk | 0, λW I). (2.11)
In the Bayesian framework, we place a full Gaussian prior over each Wk in a manner analogous to
the other latent features
(W | µW ,ΛW ) ∼M∏k=1
N (Wk | µW ,ΛW ). (2.12)
The Bayesian extension hierarchically places Gaussian-Wishart priors on the latent feature hyperpa-
rameters µW ,ΛW in a manner analogous to the other latent features
(µW ,ΛW ) ∼N (µW |µ0, β0ΛW ) · W(ΛW |W0, ν0). (2.13)
As previously presented in the literature, constrained PMF was with respect to shifting user features.
It can be made abstract. The Wk introduce a new set of features Wk, k = 1, . . . ,M for each column of
Chapter 2. Matrix Factorization 13
the rating matrix R. The prediction for the (i, j) entry of the rating matrix is
E[ri,j | Ui, Vj ,W1:M ] =E
γi + ηj +
(Ui +
1
ni
M∑k=1
Ii,kWk
)>Vj
=E[γi] + E[ηj ] + E[U>i Vj ] +
1
ni
M∑k=1
Ii,kE[W>k Vj ]
=Evanilla[ri,j ] +1
ni
M∑k=1
Ii,kE[W>k Vj ],
(2.14)
where Evanilla[ri,j ] is the expectation of ri,j under the “vanilla” matrix factorization model with only
user and item (abstractly, row and column) latent features. Notice that this shifts the prediction of the
baseline model for a given user i by the average of an inner product between the column feature Vj and
the set of side features associated with other columns also rated (viewed) by that user.
This abstraction not only highlights the symmetric nature of these latent side features, it also high-
lights how the expectations are shifted relative to the baseline. The most reliable inference on prediction
of ratings can be obtained when these additional side features Wk are associated with the dimensionality
of the matrix with less sparsity (row-wise or column-wise).
Therefore, a similar constraint can be placed on the items. This can prove advantageous if the
recommendation system tends to have more sparsity across columns than across rows. By symmetry
of the model, this is as simple as transposing the user / item matrix. The model without side features
is invariant to this transposition. For the model with side features, the transposition modifies the
expectation of the rating to be
E[ri,j |Ui, Vj ,W1:M ] =E
[γi + ηj + U>i
(1
mj
N∑`=1
I`,jW` + Vj
)]
=E[γi] + E[ηj ] +1
mj
N∑`=1
I`,jE[U>i W`] + E[U>i Vj ]
=Evanilla[ri,j ] +1
mj
N∑`=1
I`,jE[U>i W`],
(2.15)
where we have defined mj =∑N`=1 I`,j as the number of users who provided feedback for item j.
When the side features Wk offset the user features Ui, the contribution from each user to the inner
product is affected. The practical impact of this is an improvement in prediction for users with few ratings
in the system. Similarly, when the side features Wk offset the item features Vj as in Equation (2.15),
the contribution from each item to the inner product is affected, improving prediction for items that
are rare in the system. Therefore, overall performance will be improved when the side features Wk are
associated with the dimensionality of the matrix that is typically more sparse.
2.4 Inference
The predictive distribution for the ratings ri,j is obtained by integrating out the features, the hyperpa-
rameters, and any other variables included in the model. This integral is computationally intractable,
Chapter 2. Matrix Factorization 14
requiring the use of approximate methods. There are two general approaches in the literature, corre-
sponding to two different approximations.
The first is to use Monte Carlo inference to obtain an approximation to the true posterior distribution.
This approach is common in the statistical communities. This can require the choice of a proposal
distribution for sampling, and the computation of acceptance / rejection rates. Such an acceptance /
rejection scheme can be problematic, requiring tuning of the proposal distribution in order to obtain a
reasonable acceptance rate. However, it is often the case that the prior distributions can be chosen such
that the posterior sampling distributions are available in closed form, allowing for Gibbs sampling. This
is the case with the models discussed in this chapter, the vanilla PMF model [30, 31] and the constrained
PMF extension.
The second is to use variational inference to obtain exact inference on an approximation to the true
posterior distribution [39]. This requires the choice of a set of independence assumptions between the
variables. Once the independence assumptions are made, the exact distributional form of the variational
approximation is determined by minimizing the Kullback-Leibler (KL) divergence between the true
posterior and the variational approximation. These have been used for other model variations [17].
The two methods are similar in that they both make inference on a distribution. They are different
in that MCMC methods attempt to approximate the true posterior, while variational methods provide
exact results on the “best approximation” (measured by KL divergence) to the true posterior given
certain independence assumptions. In the models we consider, the set of independence assumptions
common in the literature lead to the same distributional form as the Gibbs updates, and so the only
practical difference is the prediction method. In MCMC (and specifically, in Gibbs), the prediction over
samples of the features is averaged and used for inference. In variational methods, the mean of the
distribution after each update is used for prediction. This is explained further in Section 2.5.
For our experimental results, we analyze the performance of both inference methods for the models
considered.
2.4.1 Gibbs Sampling
The prior distributions for the latent features are conjugate to the likelihood, yielding tractable Gibbs
sampling distributions that are easy to sample from. In particular, all latent features discussed are
multivariate Gaussian distributions.
The posterior for the user feature Ui is a multivariate Gaussian,
Ui ∼N (Ui|µUi ,ΛUi , R, µU ,ΛU ),
where µUi =Λ−1Ui
[ΛUµU + δUτ
M∑j=1
Ii,jVj
(ri,j −
δWniV >j
(M∑k=1
Ii,kWk
))]
ΛUi =ΛU + δUτ
M∑j=1
Ii,jVjV>j .
(2.16)
The equivalent form for the model without side features is obtained by excluding the term with the
side features Wk, and has been published previously in the literature [31].
The posterior for the item feature Vj is also a multivariate Gaussian. It is obtained by setting Ui in
the posterior of the item features in the vanilla BPMF model equal to Si in Equation 2.10 [31]
Chapter 2. Matrix Factorization 15
Vj ∼N (Vj |µVj ,ΛVj , R, µV ,ΛV ),
where µVj =Λ−1Vj
[ΛV µV + τ
N∑i=1
Ii,jSiri,j
]
ΛUi =ΛU + τ
M∑j=1
Ii,jVjV>j .
(2.17)
To our knowledge, the posterior for the side features Wk in the Bayesian extension has not previously
been presented in the literature. It is also a multivariate Gaussian. For compactness, first define the
prediction made without Wm as
ri,j,−Wm =
δUUi +δWni
∑k 6=m
Ii,kWk
> Vj . (2.18)
With this definition, the posterior sampling distribution is a multivariate Gaussian, parametrized by
Wm ∼N (Wk | µWk,ΛWk
)
where ΛWm=Λw + δ2W τ
N∑i=1
M∑j=1
Ii,jIi,mn2i
VjV>j
µWm=Λ−1Wm
Λwµw + δW τ∑(i,j):
Ii,jIi,m=1
1
niVj (ri,j − ri,j,−Wm
)
.(2.19)
Similar expressions are obtained if the side features Wk are associated with rows of the rating matrix
by exchanging the notation Ui and Vj in the above expressions.
2.4.2 Variational Inference
As discussed in Section 2.4, variational algorithms start with a set of independence assumptions on the
parameters of interest. This set of independence assumptions defines a variational approximation Q that
we perform inference on. Optimizing Kullback-Leibler divergence between the true posterior and the
variational approximation defines the form of the distribution Q.
Given data D and a set of parameters θ, Kullback-Leibler (KL) divergence between the variational
approximation Q and the true distribution p is defined by,
KL(Q|p) =
∫Q(θ) log
Q(θ)
p(θ | D)∂θ. (2.20)
This can be re-expressed as
KL(Q|p) =
∫Q(θ) log
Q(θ)
p(θ | D)∂θ
=
∫Q(θ) log
Q(θ)
p(θ,D)∂θ + log p(D).
(2.21)
The second term on the right hand side is independent of Q, and hence independent of θ. This can
be expressed as
Chapter 2. Matrix Factorization 16
log p(D) =KL(Q|p)−∫Q(θ) log
Q(θ)
p(θ,D)∂θ
=KL(Q|p) +
∫Q(θ) log
p(θ,D)
Q(θ)∂θ
=KL(Q|p) +
∫Q(θ) log p(θ,D)∂θ −
∫Q(θ) logQ(θ)∂θ
=KL(Q|p) + EQ[log p(θ,D)]−H(Q),
(2.22)
where EQ denotes the expectation under the variational approximation Q, and H(Q) denotes the
entropy of the distribution.
Equation (2.22) expresses the fixed quantity log p(D) as the sum of the KL divergence, the expected
complete log-likelihood, and the entropy. Minimizing KL is therefore equivalent to jointly maximizing
the second and third terms. The sum of these two terms is known as the variational lower bound.
We use a structured mean field approximation for the distribution Q,
Q(γ1:N , U1:N , η1:M , V1:M ,W1:M , τ, µU ,ΛU , µV ,ΛV , µW ,ΛW | D)
=Q(τ)
N∏i=1
Q(γi)Q(Ui)
M∏j=1
Q(ηj)Q(Vj)
M∏k=1
Q(Wk)
×Q(µU ,ΛU )Q(µV ,ΛV )Q(µW ,ΛW ).
(2.23)
This approximation assumes pairwise independence between all the latent features Ui, Vj ,Wk, while
allowing for structure in the latent feature hyperparameters. Such mean field approximations have been
used in the literature in comparable research [17].
Under this mean field approximation, the variational lower bound factorizes,
EQ[log p(θ,D)]−H(Q)
=∏`
EQ` [log p(θ,D)]−∏`
H(Q`),(2.24)
where ` indexes each term in the product of Equation 2.23.
Optimizing with respect to Q yields the distributional form of the variational approximation Q. We
refer to Appendix C for the full derivation, and summarize the results here.
The distributional form for Q(τ) is Gamma,
Q(τ) =G(τ | aτ , bτ )
where aτ =aτ +1
2
N∑i=1
M∑j=1
I i, j
bτ =bτ +1
2
N∑i=1
M∑j=1
Ii,j(ri,j − ri,j)2.
(2.25)
The distributional form for Q(Ui) is multivariate Gaussian,
Chapter 2. Matrix Factorization 17
Q(Ui) =N (Ui | µUi ,ΛUi)
where µUi =Λ−1Ui
ΛUµU + δUτ
M∑j=1
Ii,jVj
(ri,j − δWV >j
(∑Mk=1 Ii,kWk
ni
))ΛUi =ΛU + δUτ
M∑j=1
Ii,jVjV>j .
(2.26)
The distributional form for Q(Vj) is multivariate Gaussian,
Q(Vj) =N (Vj | µVj ,ΛVj )
where µVj =Λ−1Vj
[ΛV µV + τ
N∑i=1
Ii,j(ri,j − (δUUi +δWni
M∑k=1
Ii,kWk)>Vj)
]
ΛVj =ΛV + τ
N∑i=1
Ii,j
(δUUi +
δWni
M∑k=1
Ii,kWk
)(δUUi +
δWni
M∑k=1
Ii,kWk
)>.
(2.27)
The distributional form for Q(Wm) is multivariate Gaussian,
Q(Wm) =N (Wk | µWm ,ΛWm)
where µWm=Λ−1Wm
ΛWµW + δW τ∑(i,j):
Ii,jIi,m=1
1
niVj (ri,j − ri,j,−Wm
)
ΛWm
=ΛW + δW τ∑(i,j):
Ii,jIi,m=1
1
n2iVjV
>j ,
(2.28)
where ri,j,−Wmwas defined in Equation (2.18).
Optimization is symmetric with respect to the three sets of feature hyperparameters. Given this, we
only give the explicit derivation for the user hyperparameters. The distributional form for Q(µU ,ΛU ) is
a Gaussian-Wishart
Q(µU ,ΛU ) =N (µU | µU , βN ΛU ) · W(ΛU | νU , WU )
where U =1
N
N∑i=1
Ui
µU =NU + β0µ0
N + β0
βN =β0 +N
ΛU =(N + β0)ΛU
νU =N + ν0
W−1U =W−10 +Nβ0N + β0
(U − µ0)(U − µ0)> +
N∑i=1
(Ui − U)(Ui − U)>.
(2.29)
Chapter 2. Matrix Factorization 18
2.5 Predictions
2.5.1 Gibbs Sampling
Prediction for Monte Carlo inference is performed by generating samples over sampling runs t of the
features U (t)i , V
(t)j ,W
(t)k and biases γ(t)i , η
(t)j for i = 1 : N, j = 1 : M,k = 1 : M from a Markov
Chain with stationary distribution being the posterior distribution over the model parameters and the
hyperparameters. The Monte Carlo prediction is taken as the mean of the distribution of the rating,
conditional on the samples. In other words, the average over the samples,
E[ri,j ] =1
T
[ T∑t=1
γ(t)i + η
(t)j + U
(t)i
>V
(t)j
]. (2.30)
2.5.2 Variational Inference
The prediction for the variational algorithm is the expected rating under the variational approximation
with the current inferred distriburion, EQ[ri,j ]
EQ[ri,j ] =EQ[γi + ηj + U>i Vj ]
=EQ[γi] + EQ[ηj ] + EQ[Ui]>EQ[Vj ].
(2.31)
The final equality follows from the linearity of the expectation operator and from the independence
assumptions made by the mean field approximation.
2.6 Experimental Setup
We experiment on two sets of data: the MovieLens 1M data set and the Epinions data set. Both of
these data sets have been used in the literature frequently in the collaborative filtering context [15, 17].
Descriptive statistics for the two data sets are given in Table 2.1
MovieLens 1M consists of 1, 000, 209 ordinal ratings on the scale 1, 2, 3, 4, 5 by N = 6, 040 users
on M = 3, 952 items. To make a direct comparison to previously reported variational results [17], we
removed any movies rated less than three times, and ensured that each user and movie appeared in the
training set once. The data was split into a 70% training, 30% testing set for evaluation. We report
root-mean-square error (RMSE) on the test set.
Epinions consists of 664,824 ordinal ratings on the scale 1, 2, 3, 4, 5 by N = 49, 290 users who rated
M = 139, 738 items. We ensured that each user and each item appeared at least once in the training
set. No other conditions were imposed on the train / test split. The data was split into a 70% training,
30% testing set for evaluation. We report root-mean-square error (RMSE) on the test set for the models
considered.
Section 2.3 abstracted the notion of constrained PMF as an additional set of latent features associated
with either rows or columns. When associated with columns, these features shifted the prediction for
each row, and vice versa when associated with rows. The optimal choice is to associate the additional
set of latent features with the dimension of the matrix with less sparsity. For MovieLens, we associate
the side features with columns, while for Epinions, we associated the side features with rows. Given
Chapter 2. Matrix Factorization 19
Table 2.1: Summary data on the MovieLens 1M and the Epinions data set.
MovieLens EpinionsNumber of Ratings 1,000,209 664,824
Number of Users 6,040 49,290Number of Items 3,952 139,738
Ratings per Item
Min 0 025th 23 150th 104 1
Mean 166 375th 323 2Max 3,428 1408
Ratings per User
Min 20 025th 44 150th 96 3
Mean 253 975th 208 9Max 2,314 724
Sparsity 4.19% 0.01%
the number of users and items in each system, this choice introduces the smallest number of additional
feature vectors.
To make clear, we consider the following three sets of models. The first includes only the user and
item biases. The second is the vanilla PMF model with user and item features and biases. The third
is the constrained PMF model that adds in the side features in addition to the existing user and item
features. A MAP estimate was generated for each model for each data set, and Gibbs sampling was
initiated from these MAP estimates.
For Gibbs sampling, experimentation with a different number of samples was used to determine a
point at which convergence occurred. Burn-in was ignored. An exploratory analysis of traceplots of the
feature vectors suggested quick mixing, and the initial decline in the overall test error was rapid. Com-
bined, both of these suggest that allowing for burn-in would have minimal improvement. Convergence
of the variational algorithm was assessed using the variational lower bound.
Unless otherwise noted, all simulations that followed used the following choices for the tuning pa-
rameters. For the Gaussian-Wishart priors on the feature vectors, (µU ,ΛU , β0, ν0) = (0d×1, Id, 1, d+ 1).
The mean value was chosen to reflect that the features are mean zero after accounting for the biases,
while the values for the scale matrix and degrees of freedom were selected to give a vague prior that was
still proper.
2.7 Experimental Results
Table 2.2 summarizes the test RMSE values obtained on the two data sets under the different models
and inference algorithms considered. The subsections that follow describe these results in detail. To
summarize our results, we find:
• The variational algorithm tends to overfit;
• The degree the Variational algorithm overfits is dependent on the choice of hyperparameters,
Chapter 2. Matrix Factorization 20
Table 2.2: Overall test error rates on the MovieLens 1M and Epinions data set under the precision andinference models considered. MAP estimate values are in parenthesis.
Data Parameters Inference RMSE MAP
MovieLens
Biases Either 0.9101 0.9210BPMF Gibbs 0.8452
0.8880BPMF VI 0.8546
BCPMF Gibbs 0.8407 0.8805
EpinionsBiases Either 1.0460 1.1298BPMF Gibbs 1.0455
1.1211BPMF VI 1.0550
BCPMF Gibbs 1.0457∗ 1.1134†The variational results are reported with an alternate choice of hyperparamters, as discussed in the
analysis below.∗ The sparsity of the Epinions data set limits the incremental benefit of side features for this data set.
specifically the Wishart scale matrix W0;
• The most significant gain in performance results from including side features to model correlational
influence.
2.7.1 Variational Inference
We first consider the performance of the variational algorithm on the model with no side features and
with the default choice of hyperparameters. Under this setup, the variational lower bound monotonically
increases and converges within the first 10 full updates of the parameter set, as illustrated in Figure 2.2
(a). However, there is clearly overfitting in the test set. The model is unable to improve upon the MAP
estimate, see Figure 2.2 (b)
Further investigation shows that the MAP estimate obtained is probabilistically unlikely under the
prior selected. Specifically, the hyperparameters of the Gaussian-Wishart priors do not favour the set of
features obtained under MAP estimation. The MAP values do not suggest a Wishart scale matrix set
to the identity. We re-run variational inference using a modified set of hyperparameters. Letting U(0)i
and V(0)j denote the features obtained in the MAP estimate, we set
W−10 =1
2diag
(N∑i=1
U(0)i U
(0)i
>)
+1
2diag
(M∑J=1
V(0)j V
(0)j
>). (2.32)
For our MAP estimate, this creates a scale matrix W0 with diagonal elements ranging in numerical
value from 38 − 82. This is an order of magnitude larger than the identity scale matrix that we see
will provide good performance for Gibbs sampling. We refer to this modified choice of prior as the
“MAP-driven” prior.
This choice of hyperparameters allows the variational algorithm to increase the lower bound for more
updates, tending to converge closer to 50 updates, see Figure 2.2 (c). In addition, the RMSE values
obtained on the test set are more favourable, see Figure 2.2 (d). However, the algorithm still overfits
slightly on the test set. The training lower bound starts to converge after approximately 50 full updates
of the variables, while the test error reaches a minimum of 0.8538 after 26 updates. By the time the
Chapter 2. Matrix Factorization 21
Figure 2.2: (left) Variational lower bound and (right) RMSE for the training and test sets for the PMFmodel with (top) default hyperparameters and (bottom) an alternative choice of hyperparameters.
(a)
0 10 20 30 40 500.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
Update
TrainTest
(b)
0 10 20 30 40 500.85
0.855
0.86
0.865
0.87
0.875
0.88
0.885
0.89
Update
RM
SE
TrainTest
(c)
0 10 20 30 40 500.25
0.3
0.35
0.4
0.45
0.5
Update
TrainTest
(d)
0 10 20 30 40 500.76
0.78
0.8
0.82
0.84
0.86
0.88
Update
RM
SE
TrainTest
lower bound has converged, this has marginally increased to 0.8546. The marginal increase suggests
stability in the test error, but the increase in test error is not desirable.
2.7.2 Gibbs Sampling
Holding the set of latent features fixed, inference via Gibbs sampling outperforms the variational mean
field approximation. This comparison is trivial when we consider the default choice for the Wishart scale
matrix of W0 = Id×d, since the variational methods overfit quickly.
For a more interesting comparison, we consider the performance of the Gibbs sampler with the
default prior and the variational algorithm with the “MAP-driven” prior. For simplicity, we focus the
discussion on the baseline model without side features. We select the Gibbs iteration and variational
update for which the overall test errors are near equal. From Figure 2.3 (a), this is the 29th update of
the parameters under variational inference, and the 29th iteration of the Gibbs sampler.
With W0 driven by the MAP estimate, the variational algorithm outperforms the Gibbs sampler in
overall test error for approximately the first 30 iterations. This performance gain is motivated by drops
Chapter 2. Matrix Factorization 22
Table 2.3: Test RMSE broken down by user frequency under Gibbs sampling for the models with andwithout side features.
Error RelativeNumber of Ratings BPMF BCPMF Change (%)
≤ 25 0.9120 0.9035 +0.9226− 71 0.8674 0.8608 +0.76
72− 146 0.8561 0.8494 +0.78147− 171 0.8508 0.8459 +0.58172− 301 0.8193 0.8155 +0.46302− 484 0.8275 0.8243 +0.38485− 829 0.8109 0.8107 +0.03
830− 2, 313 0.7475 0.7474 +0.01
in the first five iterations. After the first five iterations, the variational algorithm experiences diminishing
returns. However, the Gibbs sampler continues to drop at a similar rate beyond this point.
The two algorithms have approximately the same overall error rate on the test set, to within 0.0001,
after the 29th iteration / update. Figure 2.3 (b) illustrates the error of the two inference methods at
this point with respect to user frequency. The difference in the performance of the two algorithms is
on the order of 0.001 or less, except for the most frequent bin. This bin corresponds to the 10% most
frequent users, and the Gibbs sampler outperforms the variational algorithm.
The performance gap between the Gibbs sampler and the variational approximation for the most
frequent users suggests that the sampling distribution of the user feature vectors has noticeable variabil-
ity. Figure 2.3 (c) illustrates this by plotting the maximum variance of the d-dimensional user features
against the number of ratings the user has in the training set. These are representative values for both
inference algorithms after convergence. The vertical gap between the points for the Gibbs sampler and
the variational approximation indicates that the variational approximation tends to produce smaller
estimates of the variance than the Gibbs sampler. The vertical line represents users with 829 ratings in
the training set, which is the value above which users are included in the last bin in Figure 2.3 (b). The
persistent significant difference between these two beyond this point means that there is still variability
in the distribution of the user features that the Gibbs sampler is exploiting.
2.7.3 Side Features
We consider the predictive gain when including the side features in the model. Table 2.3 tabulates the
converged prediction error for the sampler in the model with only user and item features (ie: BPMF)
and the model with user, item, and side features(ie: BCPMF) with respect to user frequency. For
extremely common users (over 800 ratings), there is essentially no predictive gain. As expected, there
is a predictive gain for the least common users (with number of ratings on the order of 20− 30).
It is interesting to note that there is still a noticeable gain in test performance for moderately frequent
users, those with several hundred ratings. This gain highlights the importance of correlational influence
[15], which is being captured by the side features.
Chapter 2. Matrix Factorization 23
Figure 2.3: (a) Overall test error for the Gibbs sampler with default hyperparameters and variationalalgorithm with the MAP-driven hyperparameters for the PMF model without side features. (b) Thetest error with respect to user frequency. The two are nearly identical for users of all frequency, withthe exception of the most frequent users. In this bin, the Gibbs sampler outperforms the variationalapproximation. (c) Maximum variance for the user features plotted against the user frequency underboth Gibbs sampling and the Variational algorithm after convergence.
(a)
0 10 20 30 40 500.84
0.85
0.86
0.87
0.88
0.89
0.9
Iteration / Update
Te
st
RM
SE
GibbsVI
(b)
25 71 146 171 301 484 829 23130.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
Number of Ratings (Upper Bin Edge)
Te
st
RM
SE
GibbsVI
(c)
101
102
103
104
10−3
10−2
10−1
100
Number of Ratings
Ma
xim
um
Dia
go
na
l V
ari
an
ce
Variational
Gibbs
Chapter 2. Matrix Factorization 24
Figure 2.4: Test RMSE for the Gibbs sampler with and without side features.
(a)
0 50 100 150 2000.84
0.845
0.85
0.855
0.86
0.865
0.87
0.875
0.88
Sample
Test
RM
SE
No Side Features
Side Features
2.8 Conclusion
In this chapter, we reviewed the baseline matrix factorization model for collaborative filtering. We
reviewed one constrained PMF model that was introduced to improve recommendation for cold start
users, and extended this to the Bayesian framework. We provided a comparison between Gibbs sampling
and variational inference, noting that variational inference requires precise tuning of the Gaussian-
Wishart priors for optimal performance and is also prone to overfitting. Based on this, we advocate the
further user of Monte Carlo methods for prediction in these models.
An analysis of the performance of the Gibbs sampler with respect to user frequency demonstrated
that the inclusion of side features offers predictive gains for even moderately common users, those with
several hundred ratings. This highlights the importance of modeling correlational influence in the rating
patterns.
It is worth noting, however, that the computation time required for sampling the side features is
substantial. Sampling a single side feature Wk requires considering the subset of the rating matrix
consisting of all users who rated a given item. That is, one must consider all users ui for which Iui,k = 1,
and all items each of these users rated. For globally popular items, this can be a substantial proportion
of the original data set. Future work can look at probabilistic ways to select items for which these side
features will be included. Subsequent chapters will look at alternative models for constraining features
that have less computational complexity.
Chapter 3
Precision Models for Matrix
Factorization
The previous chapter dealt with constrained PMF models that shift the expected rating based on taste
similarity between users or rating similarity between items. In the probabilistic framework, these meth-
ods are shifting the first moment of the expectation for each entry in the rating matrix. Other work
has looked at shifting the precision for each entry in the rating matrix [17]. These extensions allow for
heteroskedastic extensions of Bayesian PMF. In general, these extensions modify the likelihood for the
rating to
(ri,j | Ui, Vj) ∼N (ri,j | U>i Vj , τi,j), (3.1)
where τi,j : N× N 7→ (0,∞) is now a function of the users and items.
Previous work has explored different approaches for introducing this heteroskedastic variance. In
particular, one extension grounded in multiplicative user / item precision factors was proposed and
found to produce lower test RMSE than the vanilla constant variance model. In this chapter, we review
the previously proposed approaches, revisit the results, and show the gains are disproportionately coming
from the most common users in the recommendation system. That is, the performance gain for including
these precision factors is improving performance for only a few very frequent users.
The predicted ratings for the most frequent users are the most accurate. In addition, the most
frequent users are actively demonstrating they are engaged in the system. In light of these, they are
practically not the users that we need to be focusing on improving recommendations for. In addition,
we show that the least frequent users do not gain better test error performance by the introduction
of the heteroskedastic variance model. We identify an overfitting issue with the estimated precisions
that causes this increase in test RSME, and highlight that this issue is present when using variational
methods for inference. This observation mirrors the overfitting problem for the variational algorithm in
Chapter 2.
To alleviate the overfitting issue, we propose a truncated precision model. Specifically, we
• outline the probabilistic framework for truncated precisions
• prove the variational distribution under a mean field approximation, and the proper estimates of
the truncated precisions
25
Chapter 3. Precision Models for Matrix Factorization 26
Figure 3.1: Bayesian Heteroskedastic Probabilistic Matrix Factorization with Gaussian-Wishart Priorsover the latent user, item, and side feature vectors. The user precision factors αi and item precisionfactors βj allows for non-constant variance in the observed preference ri,j .
ri,j
i = 1 : N
µU
ΛU
Uiµγ
λγ
γi
αi
aUbU
j = 1 : M
µV
ΛV
Vj
µη
ληηjβj
aV bV
k = 1 : N
Wk
µW
ΛW(µ0, β0)
(W0, ν0)
τ
aτ bτ
• develop a Gibbs sampler for the truncated precisions
• provide experimental results using data sets common in the literature
• connect the limiting behaviour of the truncated precision model to the constant variance and
non-constant variance models
3.1 Existing Noise Models for BPMF
Work in the literature has looked at exploring non-Guassianity through combinations of two model
variations. The first is by replacing the Gaussian priors for the features Ui and Vj by Student-t priors.
Equivalently, this approach places joint distributions on the latent features and a precision factor
(Ui, αi) ∼N (Ui | µU , αiΛU )G(αi)
(Vj , βj) ∼N (Vj | µV , αjΛV )G(βj).(3.2)
Here, X ∼ G(x | a, b) denotes a random variable with the Gamma distribution parametrized with
rate b. This corresponds to the density
p(x | a, b) ∝xa−1e−bx. (3.3)
Analytically integrating out the αi and βj produces Student-t distributions for Ui and Vj
Chapter 3. Precision Models for Matrix Factorization 27
(Ui | µU ,ΛU ) ∼∫N (Ui | µU , αiΛU )G
(αi
∣∣∣∣ 1
2aU ,
1
2bU
)dαi
(Vj | µV ,ΛV ) ∼∫N (Vj | µV , βjΛV )G
(βj
∣∣∣∣ 1
2aV ,
1
2bV
)dβj .
(3.4)
Alternatively, multiplicative models for the precision on the likelihood term have been considered,
(ri,j | Ui, Vj , αi, βj , τ) ∼N (ri,j | U>i Vj , αiβjτ)
(αi | aU , bU ) ∼G(αi
∣∣∣∣ 1
2aU ,
1
2bU
)(βj | aV , bV ) ∼G
(βj
∣∣∣∣ 1
2aV ,
1
2bV
).
(3.5)
The literature previously tested the suitability of these extensions with a variational mean field
approximation. One variational approximation considered the independence assumptions
q(U1:N , α1:N ) =
N∏i=1
q(Ui, αi)
q(V1:M , β1:M ) =
M∏j=1
q(Vj , βj).
(3.6)
This choice of independence assumptions lead to Student-t distributions for the features Ui, Vj . Sim-
ilarly, the form of the variational approximation for the multiplicative model was chosen as
q(U1:N , V1:M , α1:N , β1:M ) =
N∏i=1
q(Ui)a(αi)
M∏j=1
q(Vj)q(βj). (3.7)
This choice of independence assumptions lead to Gamma distributions for αi, βj . Experimental
results reported with these mean field approximations suggested that the user / item multiplicative
precision model improved overall performance, but that Student-t priors for the latent features did
not. Our work reviews the multiplicative model, outlines an overfitting issue, and proposes a truncated
precision model as an alternative. We demonstrate how the truncated precision model does not suffer
from the same overfitting issues.
For brevity, in the work that follows, we adopt the following shorthands:
1. The “constant”, or homoskedastic, model, meaning the vanilla PMF model where E[ri,j ] = U>i Vj ,
and Var[ri,j ] = τ−1;
2. The “robust”, or heteroskedastic, model, referring to the proposed multiplicative model referenced
above and in Equation 3.7. Here, E[ri,j ] = U>i Vj and Var[ri,j ] = (αiβjτ)−1;
3. The “truncated” model, to be outlined in this chapter.
3.2 Truncated Precisions
The choice of the Gamma distribution for the precision factors is computationally convenient. This
choice leads to relatively simple distributional forms for a variational mean field approximation, and is
Chapter 3. Precision Models for Matrix Factorization 28
conjugate in the case of Gibbs sampling. In practice, this choice is inappropriate. It is limited in that
the Gamma distribution is unbounded towards zero and infinity. In the context of the recommendation
system, a user (equiv. item) precision that is close to zero results in complete vagueness for the ratings
for that user (equiv. item), allowing the posterior distribution for the rating to be arbitrarily broad.
Similarly, a user (equiv. item) precision that is arbitrarily large results in delta functions for the posterior
distribution of the rating.
In the experimental results that follow in this chapter, we noticed this Gamma model for the precisions
posed an inference problem when direct minimization of the variational objective functions was required.
For instance, the variational lower bound under the mean field approximation contains a scaled sum of
squared errors from the likelihood
−1
2τ
N∑i=1
M∑j=1
Ii,jαiβj(ri,j − E[ri,j ])2. (3.8)
The deterministic nature of variational inference may optimize this by arbitrarily shrinking some
user and item precisions to zero, while driving others arbitrarily large. The result is a decrease in the
overall error by optimizing for a subset of the user-item matrix.
A solution to this pathological behaviour is to bound the precisions to values suggested by the actual
data. Such truncation is commonly applied to different distributions, such as the Gaussian distribution.
Truncation of random variables appears often enough in practice that standard software packages contain
support this [28].
In general, an unbounded distribution with density gX(x) is truncated to (`, u) by defining fX(x) ∝gX(x)1` < x < u. For the case of the Gamma precisions in Equation (3.5), the density becomes
fX(x) ∝xα−1e−βx1` < x < u. (3.9)
The introduction of the truncated Gamma distribution introduces two additional tuning parameters:
` and u. We show that these tuning parameters for the lower and upper bound can be chosen sensibly
from the data.
3.3 Inference
The multiplicative models for Gamma noise and truncated Gamma noise are independent of the side
feature extension in Chapter 2. Consequently, the multiplicative models can be considered extensions
to the side feature model, or the vanilla PMF model. This leads to four possible models by selectively
including side features and selectively including precisions.
With respect to inference, all four models readily permit inference using either Monte Carlo methods
or variational inference. All four models are still conjugate in the Gibbs framework, allowing for efficient
Gibbs sampling. It can be shown that under an appropriately chosen mean field approximation, the
form of the variational distribution is still tractable. We outline in this section the necessary extensions
to inference from Chapter 2.
Chapter 3. Precision Models for Matrix Factorization 29
3.3.1 Gibbs Sampling
The choice of Gamma distribution as a prior for the multiplicative user and item precisions is conjugate
to the likelihood. In particular, the posterior for the user precisions is a Gamma,
(αi | D) ∼G(αi | aUi , bUi)
where aUi =aU +1
2
M∑j=1
Ii,j
bUi =bU +τ
2
M∑j=1
Ii,jβj(ri,j − ri,j)2.
(3.10)
A similar expression for the item precisions holds by symmetry
(βj | D) ∼G(βj | aVj , bVj
aVj =aV +1
2
N∑i=1
Ii,j
bVj =bV +τ
2
N∑i=1
Ii,jαi(ri,j − ri,j)2.
(3.11)
This parametrization continues to hold if the priors for the precisions are truncated Gamma distri-
butions. Let X be an arbitrary precision random variable with truncated Gamma density. Then X has
density
fX(x) ∝xα−1e−βx1(x ∈ (a, b)). (3.12)
However, the sampling algorithm requires the introduction of an auxiliary variable [3]. Introduce the
auxiliary variable Y ∼ Unif(0, e−βx). With this auxiliary variable, a conditional density for the precision
can be defined. Continuing to let X denote an arbitrary precision, we have
fX|Y (x|y) ∝xα−11(x ∈ (a,minb,− log(y)/β). (3.13)
In the context of Monte Carlo inference, only − log(y)/β is needed for sampling. The CDF for this
satisfies
P
(− 1
βlog y > u
)=P (log y < −βu)
=P (y < e−βu)
=e−βu
e−βx
=e−β(u+x).
(3.14)
This expression defines the CDF for the auxiliary variable − log(y)/β. Therefore, the inverse CDF
method can be used for sampling. Explicitly, let p ∈ (0, 1),
1− p =1− FY (u)
=⇒ − 1
βlog(1− p) =u+ x.
(3.15)
Chapter 3. Precision Models for Matrix Factorization 30
With a sample of − log(y)/β, the inverse CDF method can be subsequently used to generate a sample
for the precision random variable X. Define M = minb,− log y/β as the minimum of the upper bound
b and the sampled auxiliary variable. The required normalizing constant is defined by
Z−1 ≡ βα
Γ(α)
∫ M
a
xα−1 dx
βα
Γ(α)
Mα − aα
α.
(3.16)
Therefore, the inverse CDF technique will yield the needed precision sample y ∼ X by satisfying the
equation
p =
∫ y
a
Zβα
Γ(α)xα−1 dx
=
(1
Mα − aα[xα]yx=a
)=
(yα − aα
Mα − aα
).
(3.17)
The solution to this equation gives our sample, y, as
y = (p(Mα − aα) + aα)1/α
= (pMα + (1− p)aα)1/α
.(3.18)
This solution poses a numerical issue. From Equation (3.10)-(3.11), α = aU , bU, which grows with
the number of observed values in the row or column of the rating matrix. These can easily be large,
leading to numerical overflow. The solution is to compute Equation 3.18 on the log scale and rewrite.
log y =1
αlog (pMα + log(1− p)aα)
=1
αlog(elog(pM
α) + e(1−p)aα)
=1
αlog(elog p+α logM + elog(1−p)+α log a
).
(3.19)
The log-sum-exp trick can now be used to sample log y, and in turn y.
The heteroskedastic precision model re-weights the terms in the previous model, leading to a different
posterior sampling distributions. These are outlined in Appendix C and Appendix B.
3.3.2 Variational Inference
We extend the variational mean field approximation used in Section 2.4.2. Specifically, we modify
Equation 2.23 to include the αi and βj , retaining pairwise independence,
Q(γ1:N , U1:N , α1:N , η1:M , V1:M , β1:M ,W1:M , τ, µU ,ΛU , µV ,ΛV , µW ,ΛW | D)
=Q(τ)
N∏i=1
Q(γi)Q(Ui)Q(αi)
M∏j=1
Q(ηj)Q(Vj)Q(βj)
M∏k=1
Q(Wk)
×Q(µU ,ΛU )Q(µV ,ΛV )Q(µW ,ΛW ).
(3.20)
Under this mean field approximation, it can be shown that Q(αi) and Q(βj) are also Gamma distri-
Chapter 3. Precision Models for Matrix Factorization 31
butions if αi and βj are Gamma distributions according to p(θ,D). Equivalently, Q(αi) and Q(βj) are
truncated Gamma distributions if αi and βj are truncated Gamma distributions according to p(θ,D).
MAP estimation for the truncated Gamma distribution is still available in closed form. Consider the
case where a precision random variable X is a Gamma distribution, with parameters (a, b), truncated
to the interval (`, u). The MAP estimate is the intuitive estimate
max
(a,min
(α− 1
β, b
)). (3.21)
This corresponds to the usual MAP estimate, truncated to the boundaries of the truncated Gamma
distribution.
The proof is as follows. The derivative with respect to the density yields the critical point x∗ =
(α− 1)/β. If this is in [a, b], we are done. If not, then we need to check the boundaries.
Consider the case x∗ > b (the case x∗ < a is similar). We need to show that f(b)/f(a) > 1 for b to
be the MAP estimate. We have
f(b)
f(a)=f(b)/f(x)
f(a)/f(x). (3.22)
For the numerator,
f(b)
f(x)=bα−1e−βb
xα−1e−βx
=
(b
x
)α−1e−β(b−x),
(3.23)
while the denominator yields
f(a)
f(x)=aα−1e−βa
xα−1e−βx
=(ax
)α−1e−β(a−x).
(3.24)
Therefore
f(b)
f(a)=
(bx
)α−1e−β(b−x)(
ax
)α−1e−β(a−x)
=
(bx
)α−1(ax
)α−1 e−β(b−x)e−β(a−x).
(3.25)
Since x > b > a, we have 1 > b/x > a/x, so the first term is > 1. The second term is also > 1 since
x > b > a
=⇒ 0 > b− x > a− x
=⇒ 0 < −β(b− x) < −β(a− x)
=⇒ 1 < e−β(b−x) < e−β(a−x).
(3.26)
Chapter 3. Precision Models for Matrix Factorization 32
3.4 Prediction
Prediction for these models with truncated precision is performed identical to the latent feature models
in Chapter 2. For brevity, the reader is referred to Section 2.5 for the details.
3.5 Experimental Setup
We extend our experimental results from Chapter 2, and so we experiment on the same data sets: the
MovieLens 1M data set and the Epinions data set. We make use of the same tuning parameters and the
same MAP estimates discussed in Section 2.6. All precisions were initialized to 1, corresponding to the
vanilla PMF model.
3.6 Experimental Results
Table 3.1 summarizes the test RMSE values obtained on the two data sets under the different models
and inference algorithms considered. The subsections that follow describe these results in detail. To
summarize our results, we find:
• The variational algorithm tends to overfit;
• Modeling precisions can improve performance, though it may be necessary to bound the precisions
for the deterministic approximations given by variational inference to be sensible;
• The most significant gain in performance results from including side features to model correlational
influence. When these are included, there is no clear benefit of including precisions.
3.6.1 Variational inference
Relative to the homoskedastic PMF models of Chapter 2, assessing the convergence of the robust pre-
cision model is more difficult. The inclusion of user and item multiplicative precision factors allows the
variational algorithm to arbitrarily weight the contribution to the complete log-likelihood from different
rows and columns in accordance with the predictive accuracy of the model. To be concrete, rows and
columns with low accuracy can be down-weighted with small precisions. This is discussed further in
Section 3.6.2 when we compare the predictive performance of the variational algorithm to the predictive
performance of the Gibbs sampler.
To outline, the variational lower bound computed on the training set increases almost linearly, with
the training error smoothly dropping. The test error reaches a minimum of 0.8570 after nine updates.
The variational lower bound computed on the test set decreases for 13 updates. Figure 3.2 (right column)
illustrates this decrease. The lower bound on the test set is able to continue to increase between the
ninth and 13th update, while the test error increases, as the squared error on the test set is weighted in
the lower bound by the user / item precisions.
Chapter 3. Precision Models for Matrix Factorization 33
Table 3.1: Overall test error rates on the (a) MovieLens 1M and (b) Epinions data set under the precisionand inference models considered. MAP estimate values are in parenthesis.
(a)
Constant Robust Truncated (n = 2) MAP
No FeaturesGibbs
0.9101 0.9210VI
BPMFGibbs 0.8452 0.8448 0.8475
0.8888VI† 0.8546 0.8570 0.8521
BCPMF Gibbs 0.8407 0.8407 0.8805
(b)
Constant Robust Truncated (n = 2) MAP
BPMFGibbs
1.0460 1.1298VI
BCPMFGibbs 1.0455
1.1211VI 1.0550
Side Gibbs∗ 1.0457 1.1134
†These results are reported with an alternate choice of hyperparameters, as discussed in the analysisbelow.
∗ The sparsity of the Epinions data set limits the incremental benefit of side features for this data set.
Figure 3.2: (left) Variational lower bound and (right) RMSE for the training and test sets for the robustprecision model with alternative choice of hyperparameters.
(a)
0 10 20 30 40 50−20
−10
0
10
20
30
40
50
Update
TrainTest
(b)
0 10 20 30 40 500.76
0.78
0.8
0.82
0.84
0.86
0.88
Update
RM
SE
TrainTest
Chapter 3. Precision Models for Matrix Factorization 34
3.6.2 Gibbs Sampling
The Gibbs sampler avoids the pathological precision issue by virtue of sampling values for the precisions
from the posterior distribution for the precisions. This permits us to assess if modeling precisions
improves performance of users for a given frequency (cold start, rare, frequent, etc). To determine
this, we examine the final converged error rate on a user frequency basis for a model with and without
precisions, holding all else equal. For this comparison, we select the model with side features, and look
at the error under Gibbs sampling. The numerical results are in Table 3.2.
There are only minor departures from equality in the test error for moderately frequent users (the
third and fourth bin), and these departures from equality are in the fourth decimal place of the error.
This represents less than a 1% relative change in predictive performance. The largest difference occurs
for the most frequent users. This is a relative gain of 0.42%. However, these users have predictions
that are already well calibrated relative to the rest of the user set. In addition, the change relative
to the model without precisions is not related to the frequency of the user. While one bin of users
shows an improvement over the vanilla PMF model, the next may in fact show a decrease in predictive
performance. Based on these results, we conclude that modeling precisions does not tend to significantly
favour users of any given frequency in the test set.
The near equality of both the constant and robust precision models could be a result of a near-constant
posterior for the precisions. To check this, we examine traceplots for the user and item precisions, as
well as the histogram of the precisions for a sample at convergence. The distribution of both the user
and item precisions was clearly non-constant and right skewed. Figure 3.3 (a) displays histograms of the
user and item precisions from a Gibbs sampler after convergence, indicating this skewness. Figure 3.3
(b) displays traceplots for a sample of the user and item precisions, showing the sampler mixed well over
a range of values. Both of these indicate that the sampler was exploiting the robust precision model.
Therefore, the near equality in predictive performance is not a result of model degeneracy.
It has been noted that the introduction of precisions prompts the variational algorithm to drive some
precisions arbitrarily small and arbitrarily large. The histograms in Figure 3.3 (a) and traceplots in
Figure 3.3 (b) indicates that the Gibbs sampler does not suffer from this limitation. For comparison,
empirical CDF curves are plotted in Figure 3.3 (c) for the user and item precisions under the variational
algorithm (left panel) and the Gibbs sampler (right panel) after convergence. The two panels are similar,
though the CDFs for the variational algorithm have been plotted on a log scale. In other words, the
converged values of the precisions under the variational algorithm are exponentially larger.
3.6.3 Truncated Precisions
It was noted that the introduction of the precisions leads to pathological results with variational inference.
We explored if bounding the precisions would alleviate this issue. Using the truncated approach discussed
in Section 3.2, we ran experiments bounding the precisions to values sensible for a scale constrained to
the interval [1, 5].
We found that bounds of (1/2, 2) produced results for MovieLens that outperformed both the constant
and robust precision model. These values also delayed the overfitting in the variational approximation
for several updates. Overfitting for this model starts after 20 full updates of the parameters.
Figure 3.4 (a) shows the overall test error of the variational algorithm under the constant, robust,
and truncated precision model with the bounds of (1/2, 2). These three models have similar behaviour
Chapter 3. Precision Models for Matrix Factorization 35
Figure 3.3: (a) Histograms of the user and item precisions from a Gibbs sample after convergence. (b)Traceplots of the user and item precisions (c) CDF curves of the converged user / item precisions forthe model under (left) Variational and (right) Gibbs. Note that the curves have similar shape, but thatthe variational CDF is plotted on a log-scale for comparison.
(a)
0 1 2 3 4 50
500
1000
1500
2000
2500
User Precisions
Precision
Fre
qe
nc
y
0 1 2 3 4 50
500
1000
1500
Item Precisions
Precision
Fre
qe
nc
y(b)
0 100 200 300 400 5000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Sample
Us
er
Pre
cis
ion
Va
lue
0 100 200 300 400 5000
1
2
3
4
5
6
Sample
Va
lue
(c)
Chapter 3. Precision Models for Matrix Factorization 36
in the initial set of parameter updates. Differences appear after the 8th update. At this point, the
variational algorithm overfits for the robust precision model, and continues to drop for the truncated
precision model for 3−4 additional iterations. The rate of increase for the two is approximately the same
until nearly the 40th parameter update, at which point the test RMSE under the truncated precision
model tends to increase at a faster rates.
Figure 3.4 (b) shows the test error by user frequency for the variational algorithm under the two
models after 50 full parameter updates. This is the point that the algorithm has begun to overfit in
the truncated and the robust model, and has stabilized for the constant precision model. This graph
shows the most significant difference in test error is in the most frequent users. The constant precision
model outperforms either heteroskedastic model (robust, truncated) by a difference of at least 0.1 in test
RMSE, a relative improvement in RMSE of 12%. The error rates are approximately the same in other
user bins, with the constant precision model performing slightly worse for moderately frequent users.
This demonstrates that the overfitting issue for the heteroskedastic models stems from some of the most
frequent users in the system.
A sequence of bounds can be formed as (`, u) = (1/n, n) for integer n. With respect to this sequence,
the truncated model has limiting cases of the constant model as n 1 and the robust model as n∞.
This leads to the question of how inference for the truncated precision model changes as n changes?
Figure 3.4 (c) plots the test error of the variational algorithm for several values of n along with the
constant and robust precision model. As expected, larger values of n are similar to the robust precision
curve, while smaller values are similar to the constant precision curve.
We draw attention to the curve for n = 2, corresponding to the precision bounds (1/2, 2). For these
bounds, the variational algorithm obtains a significantly lower error rate than the other choices of n, as
well as for the constant precision model. The consistent tendency for the truncated model to overfit early
in learning for larger values of n suggests that the truncation value has little influence on performance
after a certain point. However, the improvement for the n = 2 case over the constant precision model
does suggest there is value in modeling heteroskedastic precision among different users and different
items.
3.6.4 Overfitting in the Robust Model
It was noted that the robust precision model overfits on the test set. It is important to ask why
this overfitting occurs. To investigate this, recall that the updates for the user (respectively, item)
precision parameters are inversely proportional to the error for that user (respectively item), scaled by
the precisions. In particular, the update rule for the user precision is
α−1i ∝M∑j=1
Ii,jτβj(ri,j − ri,j)2. (3.27)
Equation 3.27 suggests that the inverse precision (ie.: the variance) should have a 1/x2 relationship
with the (scaled) user error. A scatterplot of user precisions versus user error will show if this relationship
holds in both training and testing, and a similar plot will for items. If the model is generalizing well,
the 1/x2 relationship should be clear in both training and testing.
Figure 3.5 (a) plots the user training error and test error against the user precision. The desired
1/x2 relationship is visually clear in the training set, with points rightly scattered in a 1/x2 relationship.
Chapter 3. Precision Models for Matrix Factorization 37
0 10 20 30 40 500.85
0.855
0.86
0.865
0.87
0.875
0.88
Update
Te
st
RM
SE
Robust
Truncated
Constant
(a)
25 71 146 171 301 484 829 23130.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
Number of Ratings (Upper Bin Edge)
Te
st
RM
SE
Robust
Truncated
Constant
(b)
0 10 20 30 40 500.85
0.855
0.86
0.865
0.87
0.875
0.88
Update
Te
st
RM
SE
Constant
n=2
n=5
n=10
n=20
Robust
(c)
Figure 3.4: (a) Overall test error and (b) test error binned by user frequency for the variational algorithmunder the robust and truncated precision model (with precision bounds (1/2, 2)). (c) Overall test errorfor the variational algorithm under the robust, constant, and truncated precision models for differentchoices of bounds.
Chapter 3. Precision Models for Matrix Factorization 38
This is not visually clear in the test set. That is, the relationship dictated by the model is clear in the
training set, but is weak to non-existent in the test set.
This overfitting can be quantified through Pearson correlation. A log-transformation of Equa-
tion (3.27) yields:
− log(αi) ∝ log
M∑j=1
Ii,jτβj(ri,j − ri,j)2 . (3.28)
Equation (3.28) suggests there should be a linear correlation between the set of (negative) log preci-
sions and log error rates on a user / item level. Proper inference will have this linear correlation strong
in the training set, and generalization will have this linear correlation strong in the test set. Conversely,
overfitting will will be indicated by high correlations in the training set, with much lower correlations in
the test set.
We compute these linear correlations for the set of users and items in both the training and test sets,
and plot these correlations over iterations in Figure 3.5 (b). The correlations computed on the training
set are typically large and stable over iterations. The user correlation is consistently above 0.94, while
the item correlation is consistently above 0.70. They are not exactly 1 since the updates are sequential,
while the correlations are computed after a full parameter update. These values indicate that training
is proceeding as expected under the model.
When the same values are computed for the user and item errors in the test set, significantly smaller
values are obtained initially – less than 0.7 for the users and less than 0.4 for the items. This reflects an
initial drop of over 20% between the test and training sets. In addition, these values are not consistent
over updates, unlike the values in the training set. Indeed, the correlations under the test set decrease
monotonically over parameter updates. When the model overfits in the 9th parameter update, the
correlation in the test set for the items has dropped from 0.3440 to 0.2422, while the correlation in the
test set for the users has dropped from 0.6391 to 0.5965. The large difference between the training and
test set, both in initial values and in the magnitude of the drop over iterations, is further evidence that
the robust model is overfitting and not generalizing to the test set.
Again, this overfitting with the fully robust model was observed only with variational inference. We
ran similar truncated precision experiments with the Gibbs sampler. We did not find that inference
with the Gibbs sampler was consistently improved by using truncated precisions relative to the robust
precision model. In addition, we did not find that inference with the Gibbs sampler was consistently
impacted by using truncated precisions relative to the robust precision model. This is not unexpected
given that the histograms and traceplots of the precisions in Figure 3.3 (a)-(b) indicate the precisions
remain at O(1) values under the Gibbs sampler.
The distributional forms of the precisions are identical for the Gibbs sampler and the variational
algorithm. The only difference is in the update. Under Gibbs sampling, the scaled training error is the
conditional mean, from which a sample is drawn. Under the variational algorithm, the scaled training
error is the conditional mean, and is imputed as the update. With respect to the overfitting problem,
the extra noise introduced by the Gibbs samples appears to be preventing overfitting with the precisions.
Chapter 3. Precision Models for Matrix Factorization 39
Figure 3.5: (a) Scatterplot of user/item level errors and precisions after the third full parameter updatein the variational algoirthm. (b) Correlation between (log transformed) user/item level errors andprecisions over parameter updates.
(a)
0 0.5 1 1.5 2 2.50
1
2
3
4
5
6
User Train Error
Us
er
Pre
cis
ion
Scatterplot of User−Level Errors and Precisions (VI) − Training Set
0 0.5 1 1.5 2 2.50
1
2
3
4
5
6
User Test Error
Us
er
Pre
cis
ion
Scatterplot of User−Level Errors and Precisions (VI) − Test Set
(b)
0 5 10 15 20 25 30 35 40 45 500.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Iteration
Co
rre
lati
on
User / Train
User / Test
Item / Train
Item / Test
3.6.5 Side Features with Precisions
When side features are included in the model, the incremental gain from using a robust precision model
is lost, see Table 3.1. The Gibbs sampler converges to the same RMSE value for both the constant and
robust precision model.
A closer examination of the test error over iterations show subtle differences in how the common
converged value is obtained. In Figure 3.6, we observe the constant model sees a more substantial drop
in the first 50 iterations, after which the incremental gain is minor. The robust model takes longer to
converge, outperforming the constant model after approximately 80 iterations.
We wish to compare the result of including side features in the matrix factorization model, Table 2.3
in Chapter 2, to the result of including precisions in the model, Table 3.2. We see that the largest
relative improvement of 0.42% by the precisions for the top 10% of users is comparable to the gains
made by the inclusion of side features for the first five bins, corresponding to half of the MovieLens test
set. This highlights the importance of a model to make accurate predictions for rare users. Significant
gains overall may be the result of gains for a small selection of users, as is the case for the inclusion of
precisions in the model. Alternatively, significant gains overall may be the result of gains for a larger
selection of users, as is the case for the inclusion of side features in the model.
Further, the relative changes in Table 3.2 are not uniformly an improvement for the model with
precisions over the model without. For some sets of users, such as the most frequent, the model with
precisions improves over the baseline. For other sets of users, such as those with approximately 200
ratings, the model with precisions performs worse than the baseline. This highlights that the inclusion
of precisions is not necessarily benefiting frequent or infrequent users. Indeed, it is not clear how
precisions are influencing predictive power in any systematic way.
Chapter 3. Precision Models for Matrix Factorization 40
Table 3.2: Test RMSE broken down by user frequency under Gibbs sampling for the models with andwithout user / item precisions when side features are included.
Error RelativeNumber of Ratings Constant Robust Change (%)
≤ 25 0.9035 0.9031 +0.0526− 71 0.8608 0.8615 −0.08
72− 146 0.8494 0.8478 +0.19147− 171 0.8459 0.8449 +0.11172− 301 0.8155 0.8165 −0.12302− 484 0.8243 0.8250 −0.08485− 829 0.8107 0.8098 +0.10
830− 2, 313 0.7474 0.7443 +0.42
0 100 200 300 400 5000.84
0.842
0.844
0.846
0.848
0.85
Sample
Test
RM
SE
Constant
Robust
(a)
Figure 3.6: Test RMSE for the Gibbs sampler for the models with side features under the constant androbust precision model.
3.7 Conclusion
The variational algorithm exhibited pathological behaviour with respect to user and item precisions. In
optimizing the variational lower bound, the algorithm drove a subset of precisions to arbitrarily small
values, and another subset to arbitrarily larges values. Based on this, we investigated if bounding the
precisions had any influence on predictive performance. We replaced the Gamma priors by truncated
Gamma priors, and compared the performance of the variational algorithm for different bounds. In
changing the precision bounds monotonically, a non-monotonic change in the performance of the trun-
cated models was observed over the constant model. It was noted that some bounds do outperform both
the constant and the robust precision models. Further work could investigate automated ways to select
the precision bounds.
We noted that the gain from modeling user and item level precision is not significant when we move
to a model class that includes side features. The same predictive performance is obtained overall. In
addition, there is near equality in predictive performance within sets of users of different frequency.
Chapter 4
Meta-Constrained Latent User
Features
4.1 Introduction
The low rank matrix assumption for collaborative filtering starts with a (user, item) matrix of preferences
R ∈ RN×M and factorizes it as the product of two low rank matrices R = U>V , where U ∈ Rd×N , V ∈Rd×M . Each column in U is the latent feature of a user, each column in V the latent feature of an item,
and the ri,j entry can be reconstructed as the inner product U>i Vj , where Ui is the ith column of U ,
and Vj is the jth column of V. The problem of estimating U and V can be approached as an incomplete
SVD problem.
In the probabilistic framework, ri,j is modeled as a Gaussian with mean U>i Vj , and each column of
U, V is has an independent Gaussian prior placed on it. The mean is therefore a linear combination of
the inner product of user and item feature components. The strength of this method therefore depends
on the ability of the latent feature model to capture concrete features of the items. Prior work has
shown that explicitly incorporating auxiliary information about individuals can be very predictive of
personality traits [16]. As personal taste in movies, items, and social networks are aspects of personality,
this auxiliary information may be predictive in the collaborative filtering context.
In this chapter, we consider using auxiliary information on users to introduce constraints between
the user features. The constraint differs from the vanilla constrained probabilistic matrix factorization
model reviewed and extended in Chapter 2 [32]. The previous extension was based on additional latent
features, one for each time, while our method is based on incorporating select auxiliary information
on users. The difference is illustrated by considering the relationship of the side features to the user
features and to the rating. In the vanilla constrained PMF model, the user features are independent
of the side features in the prior, and the rating is dependent on the side features in the prior. In our
proposed model, the user features are dependent on the side / meta features in the prior, and the rating
is conditionally independent of the side features given the user features. A graphical model comparing
the proposed model to the vanilla constrained PMF is in Figure 4.1. Solid lines and nodes are present in
both models, dashed lines / nodes are present in the vanilla constrained PMF model, and dotted lines
/ nodes are included in the meta constrained model.
Relative to the vanilla constrained PMF model, our proposed model has lower time and space com-
41
Chapter 4. Meta-Constrained Latent User Features 42
Figure 4.1
ri,j
i = 1 : N
µU
ΛU
Ui
fi
µγλγ
γij = 1 : M
µV
ΛV
Vj
µη λη
ηj
k = 1 : dm
Wk
µW
ΛW(µ0, β0)
(W0, ν0)
τ
aτ bτ
plexity. For a low rank matrix factorization model in d dimensions with N users M items, and dm
auxiliary features, the vanilla constrained PMF model has (N + 2M)d parameters to sample, while the
proposed model has (N + M + dm)d features, where dm is the number of user meta features included
in modeling. We would expect that M dm, since there are typically many more items in the system
than auxiliary user features that would be relevant for modeling purposes. In addition, estimation of the
auxiliary user features in our proposed model admits a simpler form than the side features in the vanilla
constrained PMF model. In particular, the posterior sampling distribution for Gibbs does not require
pooling ratings over both users and items, and matrix calculus admits a simple gradient to update all
user meta features in parallel.
4.2 Exploratory Analysis
Prior to developing a probabilistic recommendation model to incorporate personal user attributes, it is
prudent to determine if such personal attributes may actually be informative in the recommendation
framework. To answer this, we looked at the MovieLens 1M data set. This data set has frequently been
used as an experimental data set for recommendation systems, including matrix factorization models.
This data set includes several personal attributes on users, including age, gender, occupation, and zip
code. In the provided data set, both age and occupation are categorized. Age is binned into intervals,
and occupation is binned into groups. See Section 4.5 for more details on the auxiliary user information.
Much work in the literature has ignored these labels on user, focusing on purely latent based matrix
factorization models.
To investigate if user attributes may be beneficial to incorporate in the probabilistic recommendation
framework, we learned a baseline matrix factorization model ri,j = U>i Vj consisting of user and item
features only. Learning was achieved through batch gradient descent on γi, ηj , Ui, Vj on a sum-of-squared
error term with quadratic regularizers on the features and offsets:
Chapter 4. Meta-Constrained Latent User Features 43
` =
N∑i=1
M∑j=1
Ii,j(ri,j − ri,j)r
+ λU
N∑i=1
U>i Ui + λγ
N∑i=1
γ2i
+ λV
M∑j=1
V >j Vj + λη
M∑j=1
η2j .
(4.1)
Two dimensional latent features Ui, Vj were used so that the resulting MAP estimate could be easily
visualized.
Given the two-dimensional MAP estimate for the latent features, we extracted meta information on
the users corresponding to age, gender, and occupation. This yielded for each user i a binary label vector
fi ∈ 0, 1dm , where dm is the amount of meta information extracted. Under the hypothesis that these
labels were informative, we would expect the user features Ui to have different distributions for different
labels fi. In the probabilistic framework, this can be formalized as follows. Let Zi ∈ N be a cluster label
for user i. Each assignment Zi = zi uniquely corresponds to a configuration for a binary label vector fi.
The distribution of Ui is then multivariate Gaussian, conditional on Zi,
(Ui | Zi) ∼ N (Ui | µU (Zi),ΛU (Zi)). (4.2)
Equivalently, since Zi is uniquely determined by fi
(Ui | Zi) ∼ N (Ui | µU (fi),ΛU (fi)). (4.3)
To qualitatively explore this, we group the latent user features from the MAP estimate based on
different configurations of the label vectors fi. For each group, we obtained maximum likelihood es-
timates of the mean and precision matrix. In the MovieLens data set, there are approximately 250
different configurations of fi among the 6, 040 users. This yields estimates of the distribution we define
in Equation (4.3). We plot two typical subsets of these estimates in Figure 4.2. From these plots, we
can see there is qualitative evidence to suggest that both the mean and the precision for the user’s latent
feature can vary depending on the configuration fi. In the context of recommendation systems, different
users can have clusters of preferences, and hence should have clusters of latent features, based on user
meta information.
For d-dimensional latent feature vectors, the mean requires O(d) parameters and the precision matrix
requires O(d2) parameters. Many configurations fi can be rare, as they uniquely determine a specific
combination of user traits. For instance, “males aged 20-29 who are currently students” is represented
by a different fi than “males aged 20-29 who are currently baristas”. Given this, we will consider a
simplified probabilistic model where only the mean is influenced by the user meta data fi.
4.3 Model
We discuss the model in the context of a recommendation system for items to users, as that is a
common application and also the context of the data sets used in this thesis. However, the model can
be generalized to the setting of preference matching between two arbitrary sets.
Chapter 4. Meta-Constrained Latent User Features 44
Figure 4.2: MovieLens data set: Two prototypical examples of the distribution of latent user featuresobtained from a two-dimensional MAP estimate, when grouped by certain user meta information (age,gender, occupation). In both subsets, we can qualitatively see difference in both the mean and precisionfor the latent user features in each group.
(a)
−3 −2 −1 0 1−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
(b)
−3 −2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Figure 4.3: Bayesian Meta Constrained Probabilistic Matrix Factorization with Gaussian-Wishart Priorsover the latent user, item, and side feature vectors. The user precision factors αi and item precisionfactors βj allows for non-constant variance in the observed preference ri,j . The extension to scaled
ri,j
i = 1 : N
µU
ΛU
Ui
fi
µγλγ
γij = 1 : M
µV
ΛV
Vj
µη λη
ηj
k = 1 : dm
Wk
µW
ΛW(µ0, β0)
(W0, ν0)
τ
aτ bτ
Chapter 4. Meta-Constrained Latent User Features 45
Suppose to have a system with N users and M items, where each user provides a rating on a subset
of items. If user i provides a rating for item j, we denote the rating by ri,j . These ratings form a sparse
rating matrix R ∈ RN×M . Users index rows and items index columns in this rating matrix. The matrix
is sparse since each user i typically rates a subset Ni ⊂ 1, . . . ,M of items, where |Ni| M .
Standard matrix factorization methods find a low rank approximation to the rating matrix. Each
row / user i is assigned a latent feature vector Ui, each item j is assigned a latent feature vector Vj ,
and the rating by user i of item j is reconstructed by the inner product U>i Vj . In the probabilistic
framework, the rating is modeled as Gaussian conditional on the features, and the features are also
modeled as Gaussian,
(ri,j | Ui, Vj) ∼N (ri,j | U>i Vj , τ)
(Ui | µU ,ΛU ) ∼N (Ui | µU ,ΛU )
(Vj | µV ,ΛV ) ∼N (Vj | µV ,ΛV ).
(4.4)
The Bayesian extension places Gaussian-Wishart priors on the feature means and precision matrices.
(µU ,ΛU ) ∼N (µU | µ0, β0ΛU ) · W(ΛU | ν0,Λ0)
(µV ,ΛV ) ∼N (µV | µ0, β0ΛV ) · W(ΛV | ν0,Λ0).(4.5)
Suppose we have auxiliary information for the users. We assume that this information can be
categorized and encoded in a binary vector fi ∈ 0, 1dm×1 for each user i. While this may appear
restrictive, it is the case for several popular examples (ex: gender, Facebook Likes, post-secondary
education, Twitter followers, etc). For continuous information (ex: age), one can discretize it sensibly
and obtain a binary vector. We will use this auxiliary information to constrain the prior means for
the user feature vectors. For convenience, we can stack the fi in a matrix as columns to obtain an
dm × N matrix f . This matrix notation will be using in deriving the gradient descent updates. In f ,
each column provides all the information for a single user, and each row provides the information for a
single attribute for all users. From this point, we will refer to the auxiliary features, or user attributes,
as a meta feature.s
For each meta feature k ∈ 1, . . . , dm, let Wk ∈ Rd be an associated latent feature for the presence
/ absence of the meta feature. We place the same prior distribution on Wk as on the user and item
features:
(Wk | µW ,ΛW ) =N (Wk | µW ,ΛW )
(µW ,ΛW ) ∼N (µW | µ0, β0ΛW ) · W(ΛW | ν0,Λ0).(4.6)
We place the same Gaussian-Wishart prior on the these side features as placed on the user and item
feature in Equation (4.5).
With respect to the PMF model, the Wk are used to shift the prior mean for the user features Ui.
The net effect is that users with the same meta information a priori should have features with the same
mean. In the probabilistic framework, we encode this with the distribution for Ui as
(Ui | µU ,ΛU ,W1:dm , f1:N ) ∼N
(Ui
∣∣∣∣ µU + ‖fi‖−1dm∑k=1
Wkfk,i,ΛU
). (4.7)
Here fk,i is the (k, i) element of f . Note that if a user has no meta information, then the sum is
Chapter 4. Meta-Constrained Latent User Features 46
empty, we define the prior mean to be µU , and the model reverts to the vanilla PMF model.
With this, the model is defined as
(ri,j | · · · ) ∼N (ri,j | U>i Vj)
(Ui | · · · ) ∼N (Ui | µU + ‖fi‖−1∑k
Wkfk,i,ΛU )
(Vj | · · · ) ∼N (Vj | µV ,ΛV )
(Wk | · · · ) ∼N (Wk | µK ,ΛK)
(µU ,ΛU ) ∼N (µU | µ0, β0ΛU ) · W(ΛU | ν0,Λ0)
(µV ,ΛV ) ∼N (µV | µ0, β0ΛV ) · W(ΛV | ν0,Λ0)
(µW ,ΛW ) ∼N (µW | µ0, β0ΛW ) · W(ΛW | ν0,Λ0).
(4.8)
In Chapter 2, the importance of user and item offsets was discussed. These users and item offsets
were used in the modeling there and in Chapter 3. For simplicity, user and item offsets are not used in
the model for these experiments. However, we briefly outline a way in which the user offset can be tied
through this additional user information.
Let ω ∈ Rdm×1 be a vector with distribution
(ω | µω0,Λω0
) ∼N (ω | µω0,Λω0
). (4.9)
Using this vector, we define the user offset as
(γi | µγ , λγ , fi, ω) =N (γi | µγ + ω>fi, λγ). (4.10)
With these offsets included, the rating ri,j would be modeled as
E[ri,j ] = γi + ηj + U>i Vj . (4.11)
4.4 Inference
This extension of the vanilla PMF model is conjugate, which permits the use of Gibbs sampling for
inference. Many Gibbs sampling distributions are identical to the vanilla PMF model, by conditional
independence. The derivations that follow are provided in detail in Appendix E.
The user features have a new mean, to reflect the shift from the side features:
(Ui | ri,j , · · · ) ∼N (Ui | µUi , ΛUi)
ΛUi =ΛU + τ
N∑j=1
Ii,jVjV>j
µUi =Λ−1Ui
τ N∑j=1
Ii,jri,jVj + ΛU
(µU − ‖fi‖−1
N∑k=1
Wkfk,i
) .(4.12)
The sampling distribution for the meta features Wk in this model is also Gaussian,
Chapter 4. Meta-Constrained Latent User Features 47
Table 4.1: Number of ratings, users, and items for the MovieLens and Flixster data sets used for modeltesting.
MovieLens FlixsterNumber of Ratings 1,000,209 8,196,077
Number of Users 6,040 147,612Number of Items 3,952 48,794
(Wk|ri,j , · · · ) ∼N (Wk | µWk, ΛWk
)
ΛWk=Λw + ΛU
N∑i=1
‖fi‖−2f2k,i
µWk=Λ−1Wk
ΛWµW + ΛU
N∑i=1
‖fi‖−1fk,i
Ui − µU − ‖fi‖−1∑j 6=k
Wjfj,i
.(4.13)
The sampling distribution for (µU ,ΛU ) is identical to the baseline Bayesian PMF model, with Ui
replaced by the “shifted” equivalent Ui − ‖fi‖−1∑dmk=1Wkfk,i.
If considering the model with user and item offsets, the sampling distribution for γi is Gaussian,
parametrized as
(γi | · · · ) ∼N (γi | µγi , λγi)
where λγi =λγ + τ
M∑j=1
Ii,j
and µγi =λ−1γi
τ M∑j=1
Ii,j(ri,j − ηj − U>i Vj) + λγ(µγ − ω>fi)
.(4.14)
The sampling distribution for the offset vector ω is
(ω | · · · ) ∼ N (ω | µω, Λω)
where Λω =Λω0+
N∑i=1
λγifif>i
and µω =Λ−1ω
[Λω0
µω0+
N∑i=1
fi(γi − γ0)λγi
].
(4.15)
Again, the experimental results reported here do not include user or item offsets. These derivations
are included for completeness.
4.5 Experimental Setup
We make use of the MovieLens 1M data set1 and the Flixster data set2 for experimentation. Statistics
on these data sets are given in Table 4.1.
1http://grouplens.org/datasets/movielens/2http://www.cs.ubc.ca/~jamalim/datasets/
Chapter 4. Meta-Constrained Latent User Features 48
Additional sparsity was created by using a small portion of data for training. Training used 10%
of the MovieLens data and 15% of the Flixster data set. Gibbs sampling was initiated using MAP
estimates obtained as described in Section 4.7. An appropriate set of tuning parameters were used to
obtain suitable MAP estimates, but no exhaustive grid search was performed.
The small amount of training data highlights the impact of the more complicated prior in the absence
of rating behaviour of the users. Experiments were additionally performed with larger amounts of
training data. The results indicated that the proposed model and the baseline model produced nearly
identical test RMSE values with larger amounts of training data.
We used the following choices of hyper-parameters. For the Gaussian-Wishart priors on the user
feature vectors, (µU ,ΛU , β0, ν0) = (0d×1, Id, 1, d + 1). The mean value was chosen to reflect that the
features are mean zero after accounting for the biases, while the values for the scale matrix and degrees
of freedom were selected to give a vague prior that was still proper. This was chosen for consistency
with previous experimental results. Identical choices were made for the item and meta features. User
offsets and item offsets were not included in these experiments.
4.6 User Meta Information
The MovieLens 1M data set contains gender, age, occupation, and geographical information (zip code)
on the users. Age was ordinal in seven mutually exclusive ranges. Occupation was categorical in 21
categories. For each user i, we defined fi as a binary bit vector for gender, age, and occupation. We
excluded zip code for these results. This produces a binary (user, demographic) matrix of auxiliary meta
information for the 6,040 users across 30 demographics.
The Flixster data set provides gender, location, the date the user joined, the last login date, the
number of profile views, and age. Some of this data is missing for some users (ex: age is not always
available), and some is meaningless. For instance, the data format suggests that location was a free-form
response, as some values include “my house”, “on the earth”, “Texas”, “Independence Avenue”, and “its
[sic] a secret :)”. For our experiments, we first removed any user with missing age, and then used gender
and age for auxiliary information. We binned age into intervals of size 10 starting from the youngest to
oldest age in the data set. This yielded categorical bins for age of 10− 19, 20− 29, . . . , 110− 120. This
produces a binary (user, demographic) matrix of auxiliary meta information for the 147,612 users across
13 demographics.
4.6.1 PCA for User Meta Information
The categories of meta information are highly dependent as several pairs are mutually exclusive. For
instance, a user cannot have multiple declared ages. This means there is structure in the meta demo-
graphic matrix to consider. Following work in the literature for similar information [16], an experiment
was performed for each data set making use of the output of principal components. Principal compo-
nent analysis was used as an additional optional feature engineering step on the demographic matrix for
each data set. The original user demographics fi were projected onto the first k principal components,
where k was selected based on a screeplot of the eigenvalues of the demographic correlation matrix. The
screeplots for these two data sets are in Figure 4.4 for reference. The projected fi were then used as
input into the model.
Chapter 4. Meta-Constrained Latent User Features 49
Figure 4.4: Screeplots of the MovieLens 1M and Flixster meta information
(a) MovieLens 1M
0.1
0.2
0.3
0.4
MovieLens 1M
Number of PCs
Var
ianc
e
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(b) Flixster
0.0
0.1
0.2
0.3
0.4
0.5
Flixster
Number of PCs
Var
ianc
e
1 2 3 4 5 6 7 8 9 10
As in other applications, the use of PCA can significantly reduce the number of explanatory variables
needed by the model. In our particular case, we report results using the first six principal components
for each data set. This is a reduction of 80% from the original 30 variables in MovieLens, and a reduction
of 54% from the original 13 variables in Flixster.
4.7 MAP Estimate
We start from a MAP estimate for the offset and the latent features. This MAP estimate is obtained by
batch gradient descent on the following objective function
`(γ1:N , η1:M , ω, U1:N , V1:M ,W1:dm) ≡`
=
N∑i=1
M∑j=1
Ii,j(ri,j − U>i Vj)2
+ λU
N∑i=1
(Ui − ‖fi‖−1Wfi)>(Ui − ‖fi‖−1Wfi)
+ λV
M∑j=1
V >j Vj
+ λW
dm∑k=1
W>k Wk.
(4.16)
By letting W ∈ Rd×dm , where the kth row of W is Wk, Equation (4.16) can be equivalently expressed
as
Chapter 4. Meta-Constrained Latent User Features 50
` =
N∑i=1
N∑j=1
Ii,j(ri,j − U>i Vj)2
+ λU
N∑i=1
(Ui − ‖fi‖−1Wfi)>(Ui − ‖fi‖−1Wfi)
+ λV
M∑j=1
V >j Vj
+ λW
dm∑k=1
Trace(W>W).
(4.17)
The gradients with respect to Ui, Vj , and the matrix W are:
δL
δUi=−
M∑j=1
Ii,j(ri,j − U>i Vj) Vj + λU (Ui − ‖fi‖−1Wfi)
δL
δVj=−
N∑i=1
Ii,j(ri,j − U>i Vj) Ui + λV Vj
δL
δW=− λU
N∑i=1
(Ui − ‖fi‖−1Wfi) ‖fi‖−1f>i + λWW.
(4.18)
Note that this formulation allows us to update all the Wk simultaneously.
4.8 Experimental Results
Table 4.2: MovieLens. PCA used the first six principal components.
Frequency Baseline Meta Relative Change PCA Meta Relative ChangeOverall 0.9292 0.9221 -0.76% 0.9243 -0.53%[0− 1) 1.0006 1.0005 -0.01% 1.0039 0.33%[1− 2) 1.0201 1.0094 -1.05% 1.0164 -0.36%[2− 3) 1.0185 1.0143 -0.41% 1.0174 -0.11%[3− 4) 0.9872 0.9857 -0.15% 0.9857 -0.15%[4− 5) 0.9585 0.9536 -0.51% 0.9577 -0.08%[5− 10) 0.9665 0.9604 -0.63% 0.9627 -0.39%[10− 25) 0.9402 0.9343 -0.63% 0.9359 -0.46%[25− 50) 0.9151 0.9077 -0.81% 0.9057 -0.59%[50− 100) 0.9053 0.8973 -0.88% 0.8998 -0.61%100+ 0.9446 0.9441 -0.05% 0.944 -0.06%
Table 4.2 lists the test RMSE values obtained using the MovieLens 1M data, and Table 4.3 lists the
test RMSE values obtained using the Flixster data.
For both data sets, the model with meta constraints (‘Meta’) for the users outperforms the baseline
PMF model (‘Baseline’), both overall in the test set and for sets of user with different frequency of
ratings in the training set. The relative changes are all negative, reflecting a lower test RMSE for the
model with meta constraints relative to the baseline PMF model. These relative changes are small,
but on the same order of magnitude as the relative changes for the original constrained PMF model in
Chapter 4. Meta-Constrained Latent User Features 51
Table 4.3: Flixster. PCA Used the first six principal components.
Frequency Baseline Meta Relative Change PCA Meta Relative ChangeOverall 0.9066 0.8952 -1.26% 0.8953 -1.25%[0− 1) 1.1187 1.1038 -1.33% 1.1093 -0.84%[1− 2) 1.0435 1.0362 -0.70% 1.0389 -0.44%[2− 3) 0.999 0.9916 -0.74% 0.994 -0.50%[3− 4) 0.982 0.9737 -0.85% 0.9763 -0.58%[4− 5) 0.9536 0.9452 -0.88% 0.9467 -0.72%[5− 10) 0.9546 0.9447 -1.04% 0.9455 -0.95%[10− 25) 0.9429 0.9315 -1.21% 0.9324 -1.11%[25− 50) 0.918 0.9065 -1.25% 0.9062 -1.29%[50− 100) 0.8844 0.8737 -1.21% 0.8736 -1.22%100+ 0.7998 0.784 -1.98% 0.7827 -2.14%
Chapter 2.
Larger gains are seen for the Flixster data set relative to the MovieLens data set. This suggests that
the auxiliary information might be more relevant for predicting ratings in the Flixster data set.
The results of the third experiment with the demographics projected onto principal components are
reported in Tables 4.2 - 4.3 under the ‘PCA Meta’ column. There are some important observations.
First, the PCA approach produces test RMSE values almost always better than the baseline model.
The sole exception are users with no ratings in the MovieLens data set. For this group of users, the
PCA approach produces a test RMSE that is 0.33% larger than the baseline.
Second, the relative changes are generally smaller in magnitude when compared to the relative
changes of the ‘Meta’ model. This is to be expected, as the PCA approach uses only the first k = 6
principal components for each data set. This dimensionality reduction from the original 30 variables for
MovieLens, and 13 variables for Flixster, produces information loss. Still, it is encouraging that working
with a reduced set of features generated through PCA can produce results with favourable test results.
The overall test error for both data sets is in the left column of Figure 4.5. For MovieLens, the test
error under the baseline, meta, and PCA models start at approximately the same numerical value. The
overall test error under the ‘Meta’ model drops at a clearly larger rate than either the baseline or the
PCA model. This rapid drop reflects the better mixing that the meta constraints are able to allow for.
For Flixster, the overall test error is even more powerful. The bottom-left panel in Figure 4.5
illustrates that both the ‘Meta’ and ‘PCA’ models initially have a higher overall test RMSE. However,
the test RMSE under both these models drop much more quickly relative to the baseline model, and
converge to a clearly lower value. This further illustrates the benefit of the proposed model.
In the MovieLens data set, the test RMSE for cold start users drops for all three models. As
mentioned, the cold start error under MovieLens is lowest under the baseline model. Figure 4.5 (top-
right) illustrates this. However, this figure also illustrates that the ‘Meta’ model drops in test error sooner
and more rapidly than the baseline model. As the relative increase for the ‘Meta’ model relative to the
baseline model is 0.01%, the significance (both practical and statistical) of this increase is questionable.
In the Flixster data set, the baseline model is unable to improve over the MAP estimate for cold
start users. However, the ‘Meta’ model is able to improve over the MAP estimate, and over the baseline
model. This is again despite the MAP estimate for the ‘Meta’ model producing worse test error than
the baseline model. The cold start error for the PCA approach initially produces an error rate similar
to the baseline model. However, the PCA approach is able to improve on this MAP estimate error rate,
Chapter 4. Meta-Constrained Latent User Features 52
Figure 4.5: (Left) Overall test RMSE and (right) RMSE on the cold start user subset for (top) MovieLens1M and (bottom) Flixster data sets.
(a) MovieLens - Overall
0 20 40 60 80 1000.9
0.95
1
1.05
1.1
1.15
Iteration
RM
SE
baseline
meta
SVD
(b) MovieLens - No Ratings
0 20 40 60 80 100
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16
Iteration
RM
SE
baseline
meta
SVD
(c) Flixster - Overall
0 20 40 60 80 1000.85
0.9
0.95
1
1.05
1.1
Iteration
Te
st
RM
SE
baseline
meta
SVD
(d) Flixster - No Ratings
0 20 40 60 80 1001.1
1.12
1.14
1.16
1.18
1.2
1.22
Iteration
RM
SE
baseline
meta
SVD
in contrast to the baseline model error rate which does not improve.
Similar observations hold for users with few ratings in the system. Figure 4.6 illustrates the change
in test error for users with one rating (left panel) and two ratings (right panel) for both data sets. The
relative performance and rates of drops are similar to the case of users with no ratings and the entire
test set. The plots for Flixster indicate that even one rating in the system is enough to produce a
smooth decline in test error for all three models. In particular, the test error for the baseline model now
decreases smoothly from over 1.15 to 1.1187. However, the Meta model and PCA model still outperform
the baseline, achieving a test error of 1.1038 and 1.1093, respectively.
4.9 Conclusion
In this chapter, we started with an exploratory analysis of the two-dimensional user feature vectors
obtained from MAP estimation for a data set. We performed supervised clustering of these user feature
vectors using labels generated from user demographic information. Based on distributional differences
observed, we proposed a model that uses this demographic information to model the observed distribu-
tional difference.
Our proposed model outperforms the baseline PMF model with respect to test RMSE overall, and
for users with different frequency in the system at training time. In particular, our model improves
predictive accuracy for users with no ratings in the system.
Chapter 4. Meta-Constrained Latent User Features 53
Figure 4.6: Test errors for users with few ratings in the system.
(a) MovieLens - 1 Rating
0 20 40 60 80 1001
1.05
1.1
1.15
1.2
Iteration
Te
st
RM
SE
baseline
meta
SVD
(b) MovieLens - 2 Ratings
0 20 40 60 80 1001
1.05
1.1
1.15
1.2
Iteration
Te
st
RM
SE
baseline
meta
SVD
(c) Flixster - 1 Rating
0 20 40 60 80 1001
1.05
1.1
1.15
1.2
1.25
Iteration
Te
st
RM
SE
baseline
meta
SVD
(d) Flixster - 2 Ratings
0 20 40 60 80 1000.95
1
1.05
1.1
1.15
1.2
1.25
Iteration
Te
st
RM
SE
baseline
meta
SVD
Chapter 4. Meta-Constrained Latent User Features 54
Mirroring existing work in the literature for similar problems, we proposed a modified model that
uses the demographic information projected onto the first k principal components. The number of
principal components was selected heuristically based on bends in screeplots. This approach showed
mixed results. For the MovieLens data set, the results were similar to the baseline model. For the
Flixster data set, the results were similar to the original ‘Meta’ model proposed. This difference in
performance may be explained by the dimensionality reduction being much larger for MovieLens than
for Flixster, in turn meaning a higher reconstruction error for MovieLens. The bend in the screeplots
suggested k = 6 principal components for both data sets, which is a reduction in variables by 80% for
MovieLens, compared to 54% for Flixster.
Future work could explore the sensitivity of the PCA approach to the number of components. In
addition, the initial EDA of the two-dimensional user feature MAP estimates suggested there may be
distributional differences in the covariance matrix. In the proposed model, only differences in the mean
was modeled using the demographics. Future work can look at similar constraints for the covariance
matrix.
Chapter 5
A Generative Model for User
Network Constraints in Matrix
Factorization
5.1 Introduction
At the core of recommender systems are two sets of objects, with the goal of recommending items in one
set to items in the other set. In some applications, these two sets of objects are different (recommending
videos to users on YouTube, recommending apps to users on Google Play), while other applications
have the set of objects be the same (recommending people to people in social networking sites such as
Facebook, LinkedIn, OkCupid, etc.). Collaborative filtering systems propose to solve this problem by
considering preferences of other similar users to make recommendations.
The low rank matrix assumption for collaborative filtering starts with a (user, item) matrix of
preferences R ∈ RN×M and factorizes it as the product of two low rank matrices R = U>V , where
U ∈ Rd×N , V ∈ Rd×M . Each column in U is the latent feature of a user, each column in V the latent
feature of an item, and the ri,j entry can be reconstructed as the inner product U>i Vj , where Ui is the
ith column of U , and Vj is the jth column of V.
In the probabilistic framework, ri,j is modeled as a Gaussian with mean U>i Vj , and each column of
U, V is has an independent Gaussian prior placed on it. The Bayesian extension creates a hierarchical
model by adding a Gaussian-Wishart prior for the Gaussian mean and precision. While computationally
convenient, this model assumes prior independence between the users and the items. In many practical
applications of collaborative filtering, there exists an underlying network defined between users. These
networks have been discussed in the literature as both social networks and trust networks. In the case of
social networks, an edge between users suggests social similarity. In the case of trust networks, an edge
between users suggests a trust relation. While these are both networks between users, it is important to
highlight the differences between the two.
Mathematically, a network between N users can be defined by a graph
55
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization56
G =(V, E)
where V =vi |; i ∈ 1, . . . , N
and E =E = ai,j | i, j ∈ 1, . . . , N.
(5.1)
In this notation, V is a set of N vertices / nodes, and each corresponds to a user in the system. E is
an indexed collection of edges between the N users. We further define the adjacency matrix
A =(ai,j)(i,j)∈1,...,N, (5.2)
the (i, j) entry of this matrix, ai,j indicates the relationship between users i and j in the network,
corresponding to nodes i and j in the graph.
In these user networks, the network may be either directed or undirected. In the case of Facebook,
where the social tie between two users (ui, uj) is symmetric, the network is undirected. With respect
to the adjacency matrix A, this means that ai,j = aj,i. In the case of Google Plus, where a user ui
can add another user uj to a “circle” without the need for a reciprocal action, the network is directed.
Again, with respect to the adjacency matrix A, this means that ai,j and aj,i may be different. To use
colloquial terms, to “friend” someone is an undirected edge, while to “follow” someone is a directed
edge. The literature commonly assumes that trust networks are directed. For undirected networks, the
adjacency matrix is necessarily symmetric. The adjacency matrix for directed networks is not necessarily
symmetric.
In addition to trust, user networks can contain distrust statements. One example would be a “block
list” on social networking sites, or marketing sites. There has been recent work on incorporating distrust
statement in the matrix factorization model [5], but this work is not in the Bayesian context. The high
level idea is that users with positive relations should have more similar features than users with negative
relations. Specifically, consider three users u1, u2, u3 such that u1 and u2 have a favourable relationship
in the network (ex: a trust relationship or friendship), but u1 and u3 have an unfavourable relationship
in the network (ex: a u1 blocked u3). Then for some general loss function ` and distance metric d, the
optimization problem includes the penalty
` (d(Ui, Uj)− d(Ui, Uk)) (5.3)
in addition to the log-likelihood for the ratings and the standard regularization terms for the features.
Previous work has considered matrix factorization with social regularization [15]. In this framework,
the L2 norm of each user feature Ui was constrained to be simultaneously close to zero and also close
to the average of the features for the users that user i was connected to in the social network. Gradient
descent on an objective function with dual user regularization showed that this improved predictive
performance for users with few or no ratings in the system.
Inference for these gradient-based methods performed well. However, these models will not extend
easily (if at all) to a fully generative framework. To illustrate this, we will outline in this chapter a
pseudo-extension, the most natural one given the existing framework. We will demonstrate how this
pseudo-extension leads to pathological issues with inference, which results in poor test performance. As
an alternative, we propose a fully generative alternative that shares the same social network dependency.
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization57
Our experimental results demonstrate the proposed model achieves comparable test performance under
MAP estimates obtained from gradient descent. We further illustrate that the fully Bayesian extension
outperforms the MAP estimates, reinforcing the results of previous chapters that fully generative models
achieve lower test error relative to the estimates achieved from gradient descent.
To outline, our contributions in this chapter are:
• We review the existing social network based matrix factorization models which are not properly
generative. We conjecture a pseudo-extension fully generative model, and illustrate pathological
inferential issues with this pseudo-extension;
• We propose a fully generative model that mimics the same social network dependency as the
existing non-generative counterparts;
• We demonstrate that our proposed generative model achieves comparable test error with MAP
estimates obtained from gradient descent as the existing non-generative counterpart;
• We demonstrate that our proposed generative model does not suffer from pathological inferential
issues under Gibbs sampling, and also achieves lower test error relative to the MAP estimate.
5.2 Previous Work
Existing work for matrix factorization with social regularization has been considered in the literature.
Some work considered a joint factorization of a user-item rating matrix R and a user-user social graph
G. The low-rank user features were shared between the two factorizations, imposing the social constraint
in the model [21]. The drawback to this model is that the rating matrix and the social graph are on
different scales. Typically, the entries in the rating matrix are ordinal in some subset of the integers.
Conversely, the social graph matrix is typically binary. This casts the users features into two matrices
that may be on a significantly different scale.
Follow up work focused on a model for the user-item rating matrix R, and no probabilistic model
for the social graph G [20]. Social regularization was introduced by modifying the predicted user-item
rating ri,j to be a convex combination of local user influence and social network influence. Using the
notation previously defined, mean of the Gaussian for the rating is now
E[ri,j ] =αU>i Vj + (1− α)∑k
ai,kU>k Vj , (5.4)
where ai,k is the entry of the user adjacency matrix, as defined in the introduction to this chapter,
Equation (5.2).
This is a convex combination between a given user’s tastes in the first term and the user’s first-
degree neighbours in the second term. The parameter α was a tuning parameter that governed the
trade-off between individual taste and social taste in the prediction. While intuitively simple, there is
a single tuning parameter α shared across all users. In practice, some users may rely more on social
recommendations than others.
More recent work also considered using the social network G as a source of regularization for the user
features[15]. As in previous work, this work used a product of Gaussians to model the distribution of
the user features Ui,
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization58
p(U | G, τU , τN ) ∝p(U | τU )× p(U | G, τN )
=
N∏i=1
N (Ui | 0, τUI)N∏i=1
N (Ui | ‖ai,·‖−1∑j
Ujai,j , τNI).(5.5)
Here, we define ‖ai,·‖ =∑j ai,j . The first Gaussian penalizes the user features towards zero, while
the second penalizes towards the average of the user features of the first-degree neighbours.
While intuitively simple, this probabilistic formulation does not correspond to a proper generative
model. The model is circular. In particular, for any given user, the other user features must already exist.
One option would be to first generate the user features in the absence of the user network. Subsequent
generation would then be from the product of the two Gaussians in Equation 5.5. For iterative inferential
methods (including gradient descent, MCMC methods, variational inference), this can be achieved by
randomly generating the user features in the first iteration, and then updating according to the product
of Gaussians.
For this model, the equivalent energy function is
E =1
2
N∑i=1
M∑j=1
Ii,j(ri,j − U>i Vj)2
+λU2
N∑i=1
U>i Ui
+λN2
N∑i=1
(Ui − ‖ai‖−1N∑j=1
Ujai,j)>(Ui − ‖ai‖−1
N∑j=1
Ujai,j)
+λV2
M∑j=1
V >j Vj .
The gradient with respect to Ui is given by,
∂E
∂Ui=
M∑j=1
Ii,j(ri,j − U>i Vj) + λUUi
+ λN (Ui − ‖ai‖−1N∑j=1
Ujai,j)− λN∑
j:ai,j=1
ai,j(Ui − ‖ai‖−1N∑j=1
Ujai,j).
(5.6)
5.2.1 Pseudo-Generative Extension
As mentioned, the existing model presented is not properly generative as the generation of the user
features is circular. To generate the user feature Ui for any given user, the user features for all first-
degree neighbours must first be generated. These features in turn depend on other features to be
pre-generated, including Ui in the case of a symmetric social network G.
Abusing the generative process, a pseudo-generative model can attempt to be defined according to
the following procedure:
1. First, generate the user features U1, U2, . . . , UN for N users independently;
2. Next, generate the user features U1 conditional on the current set of user features, possibly from
the initial independent generation.
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization59
Given this pseudo-generative model, we can attempt to consider inference under Gibbs sampling,
albeit misguided, to see if this might yield any desirable performance.
When this model is placed in the Bayesian probabilistic framework, it can be shown that the Gibbs
sampling distribution for Ui is given by
(Ui | ·) ∼N (Ui | µUi ,ΛUi)
where ΛUi =τ∑j
Ii,jVjV>j + ΛU + ΛU
∑k 6=i
a2k,i‖ak‖2
µUi =Λ−1Ui
τ∑j
Ii,jVjV>j + ΛUµU + ΛU
‖ai‖−1∑j
Ujai,j
+ ΛU
∑k 6=i
ak,i‖ak‖
Uk − ‖ak‖−1∑j 6=i
Ujak,j
.
(5.7)
5.2.2 A Special Case
Consider the case of two users, U1, and U2, who are connected to each other but not connected to any
other user. With respect to the notation introduced, a1,2 = 1, a2,1 = 1, and a1,j = a2,j = 0 for all j > 2.
The gradient of the energy function in Equation (5.2) with respect to U1 reduces to
∂E
∂U1=
M∑j=1
I1,j(r1,j − U>1 Vj) + λUU1
+ λN (U1 − U2)− λN (U2 − U1)
=
M∑j=1
I1,j(r1,j − U>1 Vj) + λUU1.
(5.8)
This is equivalent to the gradient for the standard PMF energy function with respect to U1. In effect,
the two users are not connected. There is no additional penalty imposed for these users.
This reduction to the standard PMF framework does not occur for the sampling distributions defined
for Gibbs sampling. The mean for the sampling distribution in Equation (5.2) for this special two user
case reduces to:
µU1=Λ−1U1
τ∑j
Ii,jVjV>j + ΛUµU + ΛU (U2) + ΛU (U2)
. (5.9)
So the mean for U1 is proportional to the likelihood term from the ratings, a penalization towards
zero, and the feature U2. The prior is now dominated by the likelihood of the ratings and the feature
U2. For rare users, the Gibbs updates for U1, and U2 will continue to cycle around each other, and the
penalization towards zero may be ignored. This can lead to overfitting, as we show in the experimental
results.
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization60
5.3 Proposed Model
The existing model is not properly generative, and we have shown that the naive generative extension of
this model leads to pathological inferential behaviour. Therefore, the existing model is limited to gradient
descent methods. Inference relying on proper generative models is not possible for these models. This
is a limitation as there are many inferential methods more power than simple gradient descent that rely
on proper generative models. Indeed, this was the case for extending the constrained PMF model to the
fully Bayesian framework in Chapter 2.
To resolve this pathological behaviour, we extend the hierarchical model for the user features by an
additional layer. In the first layer, each user generates a Gaussian latent feature. Given the generated
set of individual user features, each user generates a second Gaussian feature with a mean that accounts
for the network among users. The probabilistic framework for this proposed model is given by
(ri,j | Ui, Vj , τ) ∼N (ri,j | S>i Vj , τ)
(Ui | µU ,ΛU ) ∼N (Ui | µU ,ΛU )
(Si | G, µS ,ΛS) ∼N (Si | µS + Ui + ‖ai,·‖−1∑j
Ujai,j ,ΛS)
(Vj | µV ,ΛV ) ∼N (Vj | µV ,ΛV ).
(5.10)
Here, Ui is an individual user feature, and Si is an additional features, which we will refer to as a
“shifted” user feature. The shifted user feature is used for prediction of the user ratings ri,j for a given
user i. The additional layer allows the model to be flexible in accounting for the network information.
By adjusting the magnitude of an individual’s feature Ui relative to the first degree neighbours, and
by adjusting the noise term ΛS , the model can give preference to individual tastes over those of the
first degree neighbours. Relating back to a model with tuning parameters [20], this flexibility allows the
model to adjust the trade-off α in a probabilistic manner between user tastes and social tastes at a user
level.
If ‖ai‖ = 0, then the sum is empty, and we define the prior mean for Si to be µs + Ui.
In the Bayesian context, we place standard Gaussian-Wishart priors on the feature vector hyper-
paramters
(µU ,ΛU ) ∼N (µU | µ0, β0ΛU ) · W(ΛU | ν0,Λ0)
(µS ,ΛS) ∼N (µS | µ0, β0ΛS) · W(ΛS | ν0,Λ0)
(µV ,ΛV ) ∼N (µV | µ0, β0ΛV ) · W(ΛV | ν0,Λ0).
(5.11)
5.4 Inference
The choice of a conjugate prior leads to analytically tractable Gibbs updates. The sampling distribu-
tion for the item features Vj are identical to PMF case. In this section, we summarize the sampling
distributions for the user features and the shift features.
For the shift features Si,
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization61
(Si | ·) ∼N (Si | µSi ,ΛSi)
where ΛSi =ΛS + τ∑j
Ii,jVjV>j
µSi =Λ−1Si
ΛS
µS + Ui + ‖ai‖−1∑j
Ujai,j
+ τ∑j
Ii,jVjri,j
.
(5.12)
For the user features Ui,
(Ui | ·) ∼N (Ui | µUi ,ΛUi)
where ΛUi =ΛU + ΛS + ΛS
∑k 6=i
a2k,i‖ak‖
µUi =Λ−1Ui
ΛUµU + ΛS
Si − µS − ‖ai‖−1∑j
Ujai,j
+ ΛS
∑k 6=i
ak,i‖ak‖
Sk − µS − Uk − ‖ak‖−1∑j
Ujak,j
.
(5.13)
Unlike the previous work, our model does not suffer from the pathological behaviour outlined in
Section 5.2.2. For the special two-user case outlined, the above simplifies to
µS1 =Λ−1S1ΛS (µS + U1 + U2) + τ
∑j
I1,jVjr1,j
µS2=Λ−1S2
ΛS (µS + U1 + U2) + τ∑j
I2,jVjr2,j
µU1=Λ−1U1
ΛUµU + ΛS (S1 + S2)− 2µS − 2U2
µU2 =Λ−1U2ΛUµU + ΛS (S1 + S2)− 2µS − 2U1.
(5.14)
So for any given user, the user feature U1 is constrained by the sum of the shift features S1 +S2 and
the other user’s feature U2. This prevents the user feature U1 from being directly dependent on the user
feature U2, avoiding the pathological behaviour.
5.5 Experimental Setup
The following experiments were conducted to test the performance of our proposed model to the vanilla
PMF model and the network model with dual L2 penalization of the user features [15]. Moving forward,
we refer to the model with dual penalization as the “Net model”, and our proposed model as the “Shift
model”.
To obtain MAP estimates, batch gradient descent on the equivalent energy functions for the vanilla
PMF, existing Net model, and the proposed Shift model was performed. These MAP estimates were used
to confirm the importance of using the user network, and to validate the proposed Shift model. Different
learning rates, momentum, and penalties on the features was explored in order to obtain suitable MAP
estimates. However, no grid search over these tuning parameters was completed. The end result is that
some MAP estimates may be sub-optimal.
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization62
Appropriate MAP estimates for the proposed Shift model can be obtained naively by initializing
the user, item, and shift features at random and performing gradient descent. Alternatively, the MAP
estimate for the user and item features from the baseline PMF model can be used as a starting point
for gradient descent of the Shift model.
These MAP estimates obtained were then used to initialize Gibbs samplers for all models. Inference
for all three models were performed on one set of data to highlight the pathological behaviour for the
special case outlined previously.
Four data sets were used for experimentation: Epinions, Flixster, Ciao, and Filmtrust. Summary
statistics of these four data sets are given in Table 5.1. Epinions is a data set of users rating generic
products. The other three are data sets of users rating movies. The four data sets range in orders of
magnitude for both user base and item base. On average, users have more ratings in each data set than
outgoing links, though this is not true when looking at the median per user. Flixster is an exception
where half the users have less than four ratings, but less than five outgoing links.
The notion of using user networks in the collaborative filtering framework is built on the idea that
incorporating the user network into the prior will help predictive performance. Typically, more active
users will tend to have more ratings in the system and more outgoing links. A high number of outgoing
links for users with few or no ratings may suggest a noisy network. Table 5.1 also reports the Spearman
correlation between the number of ratings and the outdegree of the users in each data set. Note that
this is highest at nearly 0.6 for Ciao, dropping to a low of 0.1 for Filmtrust. Note also that the Ciao user
network is an order of magnitude more dense than the others. In the context of the data set, this means
that users tend to trust others product raters, but do necessarily review many products themselves. This
could be an indication that the Ciao user network may be noisy.
The training / test split of the ratings was varied across data sets to ensure sparsity, in particular,
to ensure there was a sufficient number of users with no ratings and O(1) ratings in the training set.
We report the results using the train / test splits indicated in Table 5.1. Other train / test splits were
explored, with the general observation that the performance difference between the two models decreased
as the amount of data to train on increased. In any case, using the true network did not result in the
proposed model producing statistically worse results than the baseline PMF model.
For gradient descent, the 80% training set was subsequently split into a 70% training, 30% validation
set.
Similar to the results from previous chapters, we report the overall test RMSE, as well as test RMSE
for subsets of users with different frequency in the training set.
5.6 Experimental Results
Figure 5.1 illustrates the test RMSE for users of different frequency in the training set for the four data
sets. For Epinions, Flixster, and Filmtrust, the trend is for the test RMSE to decrease as the number
of ratings increases. This trend is not as strong in the Ciao test set, where there is little change as the
number of ratings increases. The trend for the RMSE to decrease as the number of ratings increases
is present in the training set, where the binned RMSE values range from 3 − 18% lower than they do
on the test set. These two observations would suggest overfitting on the test set, despite the use of a
validation set to terminate gradient descent.
For all data sets, both the Net model and the proposed Shift model consistently result in lower test
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization63
Table 5.1: Summary statistics on the four data sets used for experimentation.
Epinions Flixster Ciao FilmtrustUsers 49,289 109,218 7,375 1,508Items 139,738 42,173 106,797 2,071
Rating Matrix Sparsity 0.0097% 0.1331% 0.0361% 1.1366%Mean Ratings / User 13 56 39 24
Median Ratings / User 4 4 18 16User Network Sparsity 0.0201% 0.0113% 0.2055% 0.0718%
Mean Outdegree / User 10 12 15 1Median Outdegree / User 1 5 3 0
Rating / Degree Correlation 0.4607 0.2056 0.591 0.0996Training 80% 15% 50% 70%
RMSE than the baseline PMF model. The comparison between the Net model and the Shift model is
less clear. For some subsets of users in some data sets (users with two ratings in Epinions), the Net
model outperforms our proposed Shift model, while the converse is true for other subsets of users in
other data sets (ex: users with no ratings in Filmtrust). These results confirm the benefit of including
user networks as constraints in the model.
5.6.1 Pathological Network Behaviour
We initialize a Gibbs sampler using the MAP estimates for each of the three models on the Epinions data
set in order to highlight the inferential difficulties for users with low outdegree and low rating frequency.
With respect to the overall test RMSE, the Net model performs better both in mixing and final error
rate than the standard PMF model. This is illustrated in Figure 5.2 (a), where the overall test RMSE
over iterations is plotted for all three models. However, our results show that the Net model actually
performs worse for rare users, on average, than the standard PMF model after running a Gibbs sampler.
This is illustrated in Figure 5.2 (b), where the final test RMSE for users of different frequency is plotted
for the three models. In particular, the Net model now performs worse than the model with no user
network for users with O(1) ratings.
The shift features used for predicting the ratings incorporate the network dependence, and do not
suffer from this performance loss for rare users. Figure 5.2 (b) illustrates that the proposed model
outperforms the PMF model with no network for users of all frequency, even users with no ratings.
Figure 5.2 (a) shows that the added complexity of the hierarchical model has nearly the same convergence
rate as the Net model in the initial samples. In other words, the initial mixing rate of the sampler does
not appear to be affected by the extra layer of features for the users.
This illustrates the increase in test RMSE in the Net model is driven by a subset of users, those with
a low number of ratings in the training set. In addition, there is unusual behaviour for the estimates of
the user features for those users with a lower number of rating in the training set and low outdegree. This
is the special case we previously highlighted in Section 5.2.2. Let ‖ri‖ =∑j = ri,j denote the frequency
of user i in the training set, and recall that ‖ai‖ was defined as the outdegree of user i. Consider a
modified geometric mean of these two terms:
√(‖ri‖+ 1) · (‖ai‖+ 1). (5.15)
This is the geometric mean of user outdegree and user frequency, with each incremented by 1. This
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization64
Figure 5.1: Test RMSE for users of different frequency for the features learned from MAP estimation.
(a) Epinions
[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25)1.15
1.2
1.25
1.3
1.35T
es
t R
MS
E
PMF
Net
Shift
(b) Flixster
[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25)1
1.05
1.1
1.15
1.2
1.25
Te
st
RM
SE
PMF
Net
Shift
(c) Filmtrust
[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25)0.75
0.8
0.85
0.9
0.95
1
Te
st
RM
SE
PMF
Net
Shift
(d) Ciao
[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25)0.9
0.95
1
1.05
1.1
1.15
Te
st
RM
SE
PMF
Net
Shift
modified geometric mean is plotted against the user feature norms sampled in the final run for the Net
model in Figure 5.3 (a) and against the shift feature norms sampled in the final run for the proposed
Shift model in Figure 5.3 (b). With the Net model, L2 norms in excess of 20 are found for users where
this modified geometric mean is small. These L2 norms correspond to users with low outdegree and
low frequency in the training set. These L2 norms decrease an order of magnitude for more frequent
users (as defined by this geometric mean). Such a relationship is unexpected, and is counter to the
prior. Indeed, the prior states that users with no ratings and no connections should have a Gaussian
distribution centered around zero. With few ratings, the model prior should dominate and these features
should have distributions centered close to zero. These large L2 norms for these relatively inactive users
are not probable under the model prior. Tied with the increase in test RMSE in Figure 5.2, this is
suggestive of poor generalization.
This issue of large norms is not the case for the proposed Shift model. Figure 5.3 (b) illustrates
that the norms do not have excessive magnitude for the same set of users. The vertical scale for the
Shift feature norm is approximately 0− 3, and is approximately 0− 30 for the norm of the user features
sampled under the Net model. Figure 5.3 (c) plots the absolute change in user test RMSE between the
proposed Shift model and the Net model against this same metric. This plot is symmetric about 0 when
the modified geometric mean is greater than 5, suggesting neither model provides consistently better
RMSE for these users. When it is less than five, there is a tendency for this quantity to be negative.
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization65
Figure 5.2: Test RMSE under Gibbs sampling for the Epinions data set under the PMF, Net, andproposed Shift model.
(a)
0 50 100 150 200 250 300 350 400 450 5001.04
1.06
1.08
1.1
1.12
1.14
1.16
1.18
1.2
1.22
Iteration
Test
RM
SE
PMF
Net
Shift
(b)
[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25) [25−50) [50−100) 100+1
1.05
1.1
1.15
1.2
1.25
Number of Ratings
Test
RM
SE
BaselineNetShift
Figure 5.3: (a) User feature L2 norm under the Net model, (b) Shift feature L2 norm under the Shiftmodel, and (c) absolute change in user test RMSE in the Epinions data set versus the modified geometricmean of rating and outdegree. Note that (a) and (b) are not on the same vertical scale.
(a)
0 1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
Modified Geometric Mean
User F
eatu
re N
orm
(b)
0 1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3
Modified Geometric Mean
Sh
ift
Featu
re N
orm
(c)
0 5 10 15 20 25−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Modified Geometric Mean
Ab
so
lute
Ch
an
ge in
User
Test
RM
SE
This reflects the decrease in test RMSE for these rare users under the proposed Shift model relative to
the existing Net model.
5.6.2 Shift Model Performance
Using the MAP estimates obtained for Epinions, Flixster, and Filmtrust, Gibbs samplers were run for
500 iterations for the baseline PMF model and the proposed Shift model. The Gibbs samplers for
Ciao, when initialized from the MAP estimates obtained, was found to perform worse than starting
from random. In particular, the baseline PMF model was trapped in a local optimum, and showed no
decrease in test RMSE for several hundred iterations. This was not found for the proposed Shift model,
which started to decrease immediately. To present a less extreme comparison of the models for this data
set, we present results for Ciao with the Gibbs samplers started with random latent features.
In overall test RMSE, our proposed Shift model tends to have a faster convergence rate than the
vanilla PMF model. Figure 5.4 illustrates the overall error rate of our proposed Shift model and the
PMF model for the four data sets considered.
Similar gains are seen for users of different rating frequency in the training set. Figure 5.5 plots
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization66
Figure 5.4: Overall test RMSE under Gibbs sampling for the four data sets considered. The proposedShift model consistently outperforms the baseline PMF model.
(a) Epinions
0 100 200 300 400 5001
1.05
1.1
1.15
1.2
1.25
Iteration
Te
st
RM
SE
PMF
Shift
(b) Flixster
0 100 200 300 400 5000.895
0.9
0.905
0.91
0.915
0.92
0.925
0.93
0.935
Iteration
Te
st
RM
SE
PMF
Shift
(c) Ciao
0 100 200 300 400 5000.95
1
1.05
Iteration
Te
st
RM
SE
PMF
Shift
(d) Filmtrust
0 100 200 300 400 5000.77
0.78
0.79
0.8
0.81
0.82
Iteration
Te
st
RM
SE
PMF
Shift
the relative change over samples in the test RMSE in the Shift model from the PMF model for users of
different frequency in the four different data sets. Here, a negative relative change means the Shift model
has a lower test RMSE for that set of users in that data set for that sample when compared to the PMF
model. The strongest results are seen in Epinions and Ciao, with consistent but smaller improvements
for Flixster and Filmtrust. The final relative gains achieved by the Shift model relative to the baseline
PMF model for users of different frequency is listed in Table 5.2. The relative changes are generally in
favour of the Shift model. The few exceptions (ex: users with one or two ratings in Flixster) are small
(ex: less than 0.10% for these users) and not practically significant). The largest counterexample is for
users at least 100 ratings in the Ciao data set (relative increase in test RMSE of 0.88%), but is minor in
comparison to the 2.3% relative drop in test RMSE seen for much less frequent users.
5.6.3 Fake Networks
Most user networks are noisy, with user links not necessarily conveying taste similarity. To investigate
how our proposed model performs when the user network contains less than perfect information, we
look at the performance of the MAP estimate under the following cases. First, a partially observed user
network where each link is included with probability 0.5. Second, a user network with links generated at
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization67
Table 5.2: Test Error for the four data sets by user frequency
(a) Epinions
Frequency Baseline Shift Relative Change[0− 1) 1.1701 1.1634 -0.57%[1− 2) 1.1313 1.1223 -0.80%[2− 3) 1.0879 1.075 -1.19%[3− 4) 1.0739 1.0607 -1.23%[4− 5) 1.1047 1.0958 -0.81%
[5− 10) 1.0763 1.0635 -1.19%[10− 25) 1.0738 1.0558 -1.68%[25− 50) 1.0715 1.0424 -2.72%
[50− 100) 1.0652 1.0293 -3.37%100+ 1.0562 1.0056 -4.79%
(b) Flixster
Frequency Baseline Shift Relative Change[0− 1) 1.1106 1.1075 -0.28%[1− 2) 1.0345 1.0353 0.08%[2− 3) 0.9908 0.991 0.02%[3− 4) 0.9747 0.974 -0.07%[4− 5) 0.9468 0.9462 -0.06%
[5− 10) 0.9466 0.9454 -0.13%[10− 25) 0.9346 0.9331 -0.16%[25− 50) 0.9079 0.9077 -0.02%
[50− 100) 0.8743 0.8753 0.11%100+ 0.8406 0.8395 -0.13%
(c) Ciao
Frequency Baseline Shift Relative Change[0− 1) 0.8734 0.8727 -0.08%[1− 2) 0.9752 0.9399 -3.62%[2− 3) 0.9941 0.9698 -2.44%[3− 4) 1.006 0.9783 -2.75%[4− 5) 1.0094 0.9807 -2.84%
[5− 10) 1.0041 0.9719 -3.21%[10− 25) 1.0353 0.9955 -3.84%[25− 50) 1.0201 0.9785 -4.08%
[50− 100) 1.0423 0.9934 -4.69%100+ 0.8043 0.8114 0.88%
(d) Filmtrust
Frequency Baseline Shift Relative Change[0− 1) 0.8686 0.8657 -0.33%[1− 2) 0.8094 0.8097 0.04%[2− 3) 0.9946 0.9839 -1.08%[3− 4) 0.7002 0.6911 -1.30%[4− 5) 0.7122 0.7127 0.07%
[5− 10) 0.804 0.8039 -0.01%[10− 25) 0.742 0.7409 -0.15%[25− 50) 0.7727 0.7711 -0.21%
[50− 100) 0.8736 0.8718 -0.21%
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization68
Figure 5.5: Change in test RMSE over Gibbs samples for the proposed Shift model relative to thebaseline PMF model. Negative values correspond to the proposed Shift model performing better thanthe baseline PMF model.
(a) Epinions
0 100 200 300 400 500−0.12
−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
Iteration
Re
lati
ve
Te
st
RM
SE
[0−1)
[1−2)
[2−3)
[3−4)
[4−5)
[5−10)
(b) Flixster
0 100 200 300 400 500−0.025
−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
Iteration
Re
lati
ve
Te
st
RM
SE
[0−1)
[1−2)
[2−3)
[3−4)
[4−5)
[5−10)
(c) Ciao
0 100 200 300 400 500−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
Iteration
Re
lati
ve
Te
st
RM
SE
[0−1)
[1−2)
[2−3)
[3−4)
[4−5)
[5−10)
(d) Filmtrust
0 100 200 300 400 500−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
Iteration
Re
lati
ve
Te
st
RM
SE
[0−1)
[1−2)
[2−3)
[3−4)
[4−5)
[5−10)
random. Many random networks were attempted. The results presented here had the random networks
generated according to Algorithm 5.1
Gradient descent on the energy functions with these networks was performed, and the resulting error
rates for users of different frequency are considered. Figure 5.6 plots the test RMSE relative to the
standard PMF model for the Shift model with random network, partially observed network, and fully
observed network. We note that the random network offers no overall advantage relative to the standard
PMF model, and performs worse for rare users.
With these MAP estimates, we run a Gibbs sampler and compare the resulting test performance.
Figure 5.8 illustrates the resulting test error for the models with no network, a partial network, the full
network, and a random network for the four data sets. There are several observations to make.
1. In three cases (Epinions, Ciao, and Filmtrust), the partially observed network performs better
than the model with no network, but worse than the full network;
2. For the fourth data set (Flixster), the partially observed and random networks are both worse than
the model with no network;
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization69
Algorithm 5.1: Generation of Random Networks
Compute: P (∑j ai,j = 0) = E[
∑j ai,j == 0], the probability of a user having a link ;
Compute: ‖ai‖ = E[ai], the average outdegree per user ;Compute:
∑i ai,j | j = 1, . . . , N, the indegree for each user;
for User i = 1, . . . , N doSample p ∼ Unif(0, 1);if p > P (
∑j ai,j = 0) then
Sample links for current user with probability proportional to the indegree of all otherusers;
elseGenerate no links for user i
end
end
Figure 5.6: Test RMSE relative to the PMF model for the random, partially observed, and fully observednetwork.
[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25) [25−50) [50−100) 100+−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
Rela
tive T
est R
MS
E
Partial Net
Full Net
Fake Net
3. In three cases (again: Epinions, Ciao, and Filmtrust), the random network performs better than
no network, but worse than any real network.
The first is expected. Partial new information is being introduced. It is expected that this would
improve over the model with no network, but carry less information than the full network.
The second is suggestive of overfitting. The relative performance of the four models has changed
order between the training set and test set. The partial and random networks, achieving the best error
rates in the training set, are are achieving the worst error rates in the test set (worse than no network).
The third is not expected. There is no reason a priori to believe that random noisy links will
improve over a model with no links. There are a few possible explanations to this, which we devote to
the following section
Random Links
The experimental results indicated that completely random networks perform better than no network
in several data sets. There are some possible explanations to this.
First, there is a positivity bias in the ratings for all of these data sets, as is common with recommender
systems [24, 23]. In other words, users tend to provide explicit (positive) feedback for those items that
they are interested in and enjoy. Given this, there is a weak correlation over ratings in the entire data
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization70
set. This weak correlation may be encoded in the random network. Although not as predictive as the
real network, the random network is providing information to the Gibbs sampler.
Second, the model proposed constrains any two pairs of users. To illustrate, suppose we have four
users, and the adjacency graph is cyclic. That is, the adjacency graph is given by
A =
0 1 0 0
0 0 1 0
0 0 0 1
1 0 0 0
(5.16)
In this case, the features are pairwise constrained. That is, U1 is constrained to U2, which is con-
strained to U3, which is in turn constrained to U4, and back to U1. Random networks can include, for
instance,
A1 =
0 0 0 1
0 0 1 0
0 1 0 0
1 0 0 0
, A2 =
0 1 0 0
0 0 1 0
0 0 0 0
1 0 1 0
, A3 =
0 0 1 0
0 0 0 1
0 0 0 1
1 0 0 0
. (5.17)
Each of these three random networks is probable under Algorithm 5.1. However, each of these
networks is also formed from second, third, or fourth degree connections in the real network. The
transitivity of the constraints under the real model means, for instance, that the constraint between U1
and U4 imposed in A1 is weakly present in the real network A. This may be another possible explanation
why the random networks are performing better than no network.
Third, a much more ambitious explanation is that the sampler is able to “learn” what the right
constraints should be from the data. Recall that the user shift features used for rating prediction are
modeled as Gaussians, with mean,
E[Si] = Ui + ‖ai‖−1∑j
Ujai,j .
Following with the same notation, suppose we have two networks:
1. A real network A with links ai,j ;
2. A random network A with links ai,j .
The difference between the shift under the true network and the random network is
‖ai‖−1∑j
Ujai,j − ‖ai‖−1∑j
Uj ai,j . (5.18)
The first term is the shift that should occur under the true network, while the second term is the
shift that will occur with the random network being observed. For the Gibbs run with the random
network, we compute this term for each user in each Gibbs sample. The L2 norms are computed for
each user i at each iteration t, and normalized by the shift feature S(t)i . For each iteration t, quantiles
of these differences are computed, and we plot the value of these quantiles against iteration in Figure
5.9 for the four data sets. For two data sets where the random network model outperforms the vanilla
PMF model with no network (Epinions and Ciao), these norms drop quickly over samples. For a third
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization71
(Filmtrust), these remain large, but the difference between the random network and the vanilla PMF
model in test error is small, Figure 5.7. For the data set where the random network performs worse than
the vanilla PMF model, these are larger than the same quantiles under the Epinions and Ciao data set.
This suggests that the model may be able to infer the correct shift needed based on the observed data.
It should be noted from Table 5.1 that the data set where the test error for the random network
is not better the model with no network, Flixster, is also the least dense user network. It also has the
largest user base. In addition, Table 5.1 indicates that the Flixster data set has the lowest correlation
between the number of ratings a user has and the outdegree. The low density and large user base means
that the random network generated, which simulates the density and properties of the real network, will
be less likely to have second and third degree connections. This supports the second theory. The low
correlation between the number of ratings and the outdegree means that the network may be dominating
over the ratings in the model.
Figure 5.7: Training RMSE under Gibbs sampling for the four data sets under the Shift model withfully observed user network, partially observed user network, and completely random user network. Thetest RMSE for the baseline PMF model is included.
(a) Epinions
0 50 100 150 200 250 300 350 400 450 5000.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
Iteration
Test
RM
SE
Baseline
Half Truth
Full Truth
False
(b) Flixster
0 50 100 150 200 250 300 350 400 450 5000.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
Iteration
Test
RM
SE
Baseline
Half Truth
Full Truth
False
(c) Ciao
0 50 100 150 200 250 300 350 400 450 5000.5
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
Iteration
Test
RM
SE
Baseline
Half Truth
Full Truth
False
(d) Filmtrust
0 50 100 150 200 250 300 350 400 450 5000.55
0.6
0.65
0.7
0.75
0.8
0.85
Iteration
Test
RM
SE
Baseline
Half Truth
Full Truth
False
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization72
Figure 5.8: Test RMSE under Gibbs sampling for the four data sets under the Shift model with fullyobserved user network, partially observed user network, and completely random user network. The testRMSE for the baseline PMF model is included.
(a) Epinions
0 50 100 150 200 250 300 350 400 450 5001.04
1.06
1.08
1.1
1.12
1.14
1.16
1.18
1.2
1.22
Iteration
Test
RM
SE
Baseline
Half Truth
Full Truth
False
(b) Flixster
0 50 100 150 200 250 300 350 400 450 5000.88
0.885
0.89
0.895
0.9
0.905
0.91
0.915
0.92
Iteration
Test
RM
SE
Baseline
Half Truth
Full Truth
False
(c) Ciao
0 50 100 150 200 250 300 350 400 450 5000.95
1
1.05
1.1
Iteration
Test
RM
SE
Baseline
Half Truth
Full Truth
False
(d) Filmtrust
0 50 100 150 200 250 300 350 400 450 5000.76
0.77
0.78
0.79
0.8
0.81
0.82
0.83
0.84
Iteration
Test
RM
SE
Baseline
Half Truth
Full Truth
False
Two Asymmetric User Sets
To test the theory of weak correlation among all rating submitted, the data can be modified to create
two user sets with asymmetric tastes. Consider the following experiment. For each (user, item, rating)
= (u, i, ri,j) triplet, we randomly decide to keep the rating as is, or to “flip” the rating to the opposite (a
five becomes a 1, a 4 becomes a 2, etc). If the rating is flipped, we replace the user index u with 2u. This
effectively duplicates each user by creating one with the opposite rating patterns. Each link in the user
network (u1, u2) was modified to (u1, 2u2). Effectively, this has each user following the duplicate user
with completely opposite rating patterns. This represents an extreme case where there are two subsets
of users with asymmetric tastes. User links exist between the two user subsets, but not within the two
users sets. There is positivity bias among one set of users, but a negativity bias among the other set of
users.
Under this pathological user network, the shift model performs worse than the baseline overall.
Figure 5.10 (a) illustrates the test error over sampling runs. The performance of the model with no
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization73
Figure 5.9: Quantiles of the relative L2 magnitude of the difference in the network-induced shift factorin the fake network case from what it would be if the true network was used.
(a) Epinions
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Iteration
No
rm
ali
ze
d S
hif
t D
iffe
re
nc
e
0.05
0.25
0.50
0.75
0.95
(b) Flixster
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Iteration
No
rm
ali
ze
d S
hif
t D
iffe
re
nc
e
0.05
0.25
0.50
0.75
0.95
(c) Ciao
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Iteration
No
rm
ali
ze
d S
hif
t D
iffe
re
nc
e
0.05
0.25
0.50
0.75
0.95
(d) Filmtrust
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Iteration
No
rm
ali
ze
d S
hif
t D
iffe
re
nc
e
0.05
0.25
0.50
0.75
0.95
network is also included. The relative difference between the two does shrink marginally over samples,
but the model with no network still performs better in test RMSE.
The same performance is present when we examine subsets of users with different frequency in
the training set. Figure 5.10 (b) illustrates the test RMSE for users with different number of ratings
at training time. The model with no network outperforms the model with the pathological network
between the two contrived sets of users for all frequencies. In particular, the relative difference is most
notable for the least frequent users, shrinking as the user becomes more frequent. This is expected, given
the rating data will dominate over the data from the user network as the number of ratings increases.
5.7 Conclusion
We have reviewed existing work on matrix factorization models that make use of user networks as
constraints in probabilistic priors. This model uses the user network to modify the prior mean for each
user feature. We reviewed the form of gradient descent update of the user features under the equivalent
Chapter 5. A Generative Model for User Network Constraints in Matrix Factorization74
Figure 5.10: (a) Overall test RMSE and (b) binned by user frequency in the case of two asymmetric usersets. Here, the shift model performs worse than the baseline PMF model, with the largest differencecoming from rare users.
(a)
50 100 150 200 2501.085
1.09
1.095
1.1
1.105
1.11
1.115
1.12
BaselineShift with User Duplication
(b)
[0−1) [1−2) [2−3) [3−4) [4−5) [5−10) [10−25) [25−50) [50−100) 100+1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
Number of Ratings
Test
RM
SE
BaselineShift with User Duplication
energy function, and highlighted how the updates reduce to the baseline PMF model in the special case
of two intra-connected users who have no outward connections. Placing the existing model in the Gibbs
sampling framework, we highlighted how the existing model is prone to overfitting for such users, and
validated this analytical result with experimental results using the Epinions data set. We proposed
an alternative model based on a two-level hierarchy of features for users that avoids this issue. We
validated the performance of our proposed model against the baseline PMF model using Gibbs sampling
for multiple data sets. Our proposed model consistently has a higher drop in test RMSE over iterations
than the baseline PMF model, and frequently converges to a lower test RMSE than the baseline PMF
model.
Testing the performance of our proposed model in the presence of partial network information shows
only minor performance degradation, as measured by a increase in test RMSE over Gibbs samples,
compared to in the presence of full network truth. The performance under partial network information
was still superior to the performance under the baseline PMF model. Additional simulations with
completely random user networks show increasing performance degradation, as measured by test RMSE
over samples, compared to partial and full network truth. However, our proposed model under completely
random networks still performs significantly better than the baseline PMF model. The exact reason for
this is an open question and the focus of future work.
Chapter 6
Conclusion
Originally known best for video streaming services, recommender systems have evolved into a tool with
general applications to preference matching and information retrieval. Their applications extend to
applications such as friend suggestion, music suggestion, news aggregation, online dating, and general
preference matching. Collaborative-based systems have emerged as a common implementation as they
are scalable, efficient to learn, and are suited to a mix of media.
Many variants of a Gaussian probabilistic matrix factorization models have been proposed in the
literature, almost always reporting test performance superior to the baseline PMF model. Very little work
has addressed the issue of performance with respect to the amount of information a user provides. Indeed,
if a user is very active in the system, the predictions of these PMF models can be trusted. However,
much simpler models can perform nearly as well with large amounts of ratings for a particular user.
It can also be argued that these frequent users should not be the target of improved recommendation.
They are already active and committed to the system. More focus should be given to rare users in
the system, those with few to no ratings in the system. Not only do these users compose the majority
of users in the system, they are the ones that need to be given useful suggestions in order to increase
activity and engagement.
In Chapter 2, we reviewed an existing Constrained PMF model. We also reviewed two approaches
to inference common in the literature: Gibbs sampling and variational inference. We extended this
constrained PMF model to the fully Bayesian framework, and demonstrated that Gibbs sampling under
the fully Bayesian model uniformly outperforms MAP estimation for users of different frequency in the
system. In comparing Gibbs sampling and variational inference, we found cause to advocate for Gibbs
sampling in order to avoid overfitting and issues with tuning parameters.
In Chapter 3, we reviewed existing work on heteroskedastic PMF models. We demonstrated that the
gains previously reported were not coming from rare users at all. When using variational inference to
learn the model, we discovered overfitting and poor generalization. We proposed a truncated precision
model to overcome this overfitting issue in the variational context, and illustrated how the existing
heteroskedastic model and the baseline PMF model arise as limiting cases. We compared the performance
of the truncated model to both, and illustrated that sensible bounds can improve upon both.
In Chapter 4, demographic and other personal attributes were introduced as constraints for user
features. This was based on work from an experiment with a different context with an analogous goal
(predicting personal attributes from public information on Facebook). We demonstrated that these
75
Chapter 6. Conclusion 76
models perform nearly uniformly better than the baseline PMF model for users of different frequency.
Given the sparse nature of the demographics, we proposed a PCA-based approach that reduces the
amount of parameters in the model and still achieves superior performance.
In Chapter 5, we reviewed recent work on Social Recommendation Systems. These are models that
make use of an existing user network in the recommendation framework. We focused on a matrix
factorization model in particular, inline with the other models discussed. This model was demonstrated
to improve prediction for users, but was not generative, and so could not easily be extended to the
Bayesian framework. We proposed a fully Bayesian model in the same spirit as the existing model,
demonstrated that similar performance is achieved with MAP estimates, and demonstrated that Gibbs
sampling improves upon the MAP estimates, and over the baseline. Further, we illustrated the ability
of the model to outperform the baseline PMF model in the presence of contaminated user networks.
Questions still remain for most of these extensions. Among them:
• Is there a sensible way to adapt the bounds of the precisions in the heteroskedastic model, and
will this improve performance? In our experimental results, we demonstrated that, for bounds of
(1/n, n) for integer n, sensible n can be selected based on the scale of the ratings. Can this be
algorithmically selected?
• What demographics are the most predictive of underlying similarity in tastes, and can these be
learned in a supervised, partially supervised, or unsupervised manner? We found the gain for the
Flixster data set was larger than for the MovieLens data set. The only difference between the two
was the bins selected for the age range, and the inclusion of the user’s occupation in the MovieLens
data set.
• How does the tradeoff between contamination in the user network and gains in test prediction work?
We illustrated an extreme case of two subsets of users with opposite tastes, where the inclusion of
a user network results in performance worse than the baseline model. However, partially observed
and fully contaminated network were still often better than no network at all. While there are
possible explanations for this, there is no certainty.
Future work can address these questions. In addition, how can these methods be combined? Ensemble
methods were proven to be successful in the Netflix competition, and would be the most direct way to
combine the predictions. However, it is possible to imagine a model combining the ideas of Chapter 4 and
Chapter 5 where demographics shift individual user features, with an additional layer of user features
being shifted based on a user network.
In addition, the recent advances in deep learning has popularized the field, and deep learning methods
are now being applied to a variety of problems. Indeed, this thesis began with a review of current
approaches, including some (R)BM variants, which are “building-blocks” to deep generative models.
For more media-rich mediums, (eg: music, photos, videos), deep learning models are commonly used for
feature extraction. The feature output of such models can be used as auxiliary information and used an
inputs into matrix factorization models.
Appendices
77
Appendix A
Ancillary Results and Derivations
In this appendix, we provide a series of derivations and ancillary results needed to derive the given
results. Primarily, they are needed to obtain the variational lower bound and the conditionals for the
variables of interest.
A.1 Squared Error Term
In the derivation of the conditionals of the feature vectors, it was necessary to expand the squared error
term (ri,j − ri,j)2 and rewrite as constants plus a quadratic in terms of Ui, Vj , and Wk. We give these
three derivations here. For notational convenience, we surpress the bias terms, absorbing both γi and
ηj into ri,j .
A.1.1 Quadratic with Respect to User Features
In terms of the user feature vectors
(ri,j − ri,j)2
=
ri,j −(δUUi +δWni
M∑k=1
Ii,kWk
)>Vj
> ·ri,j −(δUUi +
δWni
M∑k=1
Ii,kWk
)>Vj
=
[(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)− δUU>i Vj
]>·
[(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)− δUU>i Vj
]
=
(ri,j − δWni
M∑k=1
Ii,kW>k Vj
)>− δUV >j Ui
· [(ri,j − δWni
M∑k=1
Ii,kW>k Vj
)− δUU>i Vj
]
=
(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)>(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)
− 2δU
(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)>V >j Ui + δ2UU
>i VjV
>j Ui,
(A.1)
where the last line follows as V >j UiU>i Vj = U>i VjV
>j Ui.
78
Appendix A. Ancillary Results and Derivations 79
A.1.2 Quadratic with Respect to Item Features
In terms of the item feature vectors
(ri,j − ri,j)2
=
ri,j −(δUUi +δWni
M∑k=1
Ii,kWk
)>Vj
> ·ri,j −(δUUi +
δWni
M∑k=1
Ii,kWk
)>Vj
=r2i,j − 2ri,j
(δUUi +
δWni
M∑k=1
Ii,kWk
)>Vj
+ V >j
(δUUi +
δWni
M∑k=1
Ii,kWk
)(δUUi +
δWni
M∑k=1
Ii,kWk
)>Vj .
(A.2)
A.1.3 Quadratic with Respect to Side Features
Finally, in terms of the side feature vector Wk, we have
(ri,j − ri,j)2
=
ri,j −(δUUi +δWni
M∑k=1
Ii,kWk
)>Vj
> ·ri,j −(δUUi +
δWni
M∑k=1
Ii,kWk
)>Vj
=
ri,j −
δUUi +δWni
∑k 6=m
Ii,kWk
> Vj− δW
niIi,mW
>mVj
>
·
ri,j −
δUUi +δWni
∑k 6=m
Ii,kWk
> Vj− δW
niIi,mW
>mVj
.
(A.3)
Let ri,j,−Wm=(δUUi + δW
ni
∑k 6=m Ii,kWk
)>Vj denote the prediction made without Wm. Then,
(ri,j − ri,j)2
=
[(ri,j − ri,j,−Wm)− δW
niIi,mW
>mVj
]>·[(ri,j − ri,j,−Wm)− δW
niIi,mW
>mVj
]= (ri,j − ri,j,−Wm)
>(ri,j − ri,j,−Wm)− 2
δWniIi,m (ri,j − ri,j,−Wm)V >j Wm
+δ2Wn2i
W>mVjV>j Wm.
(A.4)
A.2 Expectation of Certain Forms
A.2.1 Expectation of Quadratic Forms
Let x be a random vector with mean µ and covariance matrix Σ, and let Λ be a symmetric matrix. Then
E[x>Λx] =tr (ΛΣ) + µ>Λµ. (A.5)
Appendix A. Ancillary Results and Derivations 80
Combined with iterated expectation, this is used to find some expectations in the variational lower
bound. An alternative is to expand the quadratic, which we give an example of below using the user
feature quadratic form.
A.2.2 User Quadratic Form
In computing the variational lower bound, we need to consider the expectation of quadratic forms such
as
EQ[(Ui − µU )>ΛU (Ui − µU )]
=EQ[U>i ΛUUi]− 2EQ[U>i ΛUµu] + EQ[µ>UΛUµU ].(A.6)
Which appear from the prior placed on the user, item, and side features. We compute the expectation
term by term.
For the first term
EQ[U>i ΛUUi] =EQ[EQ[U>i ΛUUi|ΛU ]]
=EQ[tr(ΛUΛ−1Ui
)+ µ>UiΛUµUi ]
=tr(EQ[ΛU ]Λ−1Ui
)+ µ>UiEQ[ΛU ]µUi
=νU tr(WUΛ−1Ui
)+ νUµ
>UiWUµUi .
For the second termEQ[U>i ΛUµU ] =EQ[EQ[U>i ΛUµU |ΛU , µU ]]
=EQ[EQ[Ui]>ΛUµU ]
=µ>UiEQ[EQ[ΛUµU |ΛU ]]
=µ>UiEQ[ΛU ]µU
=νUµ>UiWU µU .
For the final term,
EQ[µ>UΛUµU ]
=EQ[EQ[µ>UΛUµU ]|ΛU ]]
=EQ[tr(
ΛU Λ−1U
)+ µ>UΛU µU ]
=tr(
EQ[ΛU ]Λ−1U
)+ µ>UEQ[ΛU ]µU
=νU tr(WU Λ−1U
)+ νU µ
>UWU µU .
Together, the three terms give
Appendix A. Ancillary Results and Derivations 81
EQ[(Ui − µU )>ΛU (Ui − µU )]
=νU tr(WUΛ−1Ui
)+ νUµ
>UiWUµUi
− 2νUµ>UiWU µU
+ νU tr(WU Λ−1U
)+ νU µ
>UWU µU
=νU
[(µUi − µU )>WU (µUi − µU ) + tr
(WU (Λ−1Ui + Λ−1U )
)].
(A.7)
Similar expressions hold for the items and the side features.
A.2.3 Gamma Random Variable Expectation
If X ∼ G(α, β), with pdf fX(x|α, β) ∝ xα−1e−βx.
E[logX] =− log(β) + ψ(α),
where ψ(·) = dd· log Γ(·). This result important in computing the contribution to the variational lower
bound from the user, item, and global precisions.
A.2.4 Wishart Random Variable Expectation
If X ∼ W(n,V), with pdf fX(X|n,V) ∝ |X|(n−p−1)/2e−tr(V−1X)/2,
E[log |X|] =
p∑i=1
ψ
(n+ 1− i
2
)+ 2 log 2 + log |V|
Like the last result, this is necessary to compute the variational lower bound, as it appears from the
conjugate Normal-Wishart priors.
Appendix B
Constrained PMF
In this appendix, we derive the conditional distribution of the features given the observed rating data
with the presence of the side information from Chapter 2. The inclusion of side information into the
model shifts the mean of the user features, and is also rederived. The conditional for the item features
follows by substituting the combination of user and side features for the user features in the original
derivation from [31].
For notational convenience, we surpress the offsets, absorbing γi and ηj into ri,j .
B.0.5 Conditional Posterior for Side Feature
The inclusion of the the side information Wm complicates the log likelihood contribution to the log
posterior. The square in the exponent of the Gaussian for ri,j becomes
log p(ri,j | · · · ) =
N∑i=1
M∑j=1
−Ii,jαiβjτ2
[ri,j − (δUUi + δW1
ni
m∑k=1
Ii,kWk)>Vj ]>
× [ri,j − (δUUi + δW1
ni
m∑k=1
Ii,kWk)>Vj ],
(B.1)
where ni =∑Mk=1 Ii,k. Using the properties of the transpose and expanding the square yields
N∑i=1
M∑j=1
−Ii,jαiβjτ2
[ri,j − V >j (δUUi + δW1
ni
m∑k=1
Ii,kWk)]
× [ri,j − (δUUi + δW1
ni
m∑k=1
Ii,kWk)>Vj ]
=
N∑i=1
M∑j=1
−Ii,jαiβjτ2
[r2i,j − 2ri,jV>j (δUUi + δW
1
ni
m∑k=1
Ii,kWk)
+ V >j (δUUi +δWni
m∑k=1
Ii,kWk)(δUUi +δWni
m∑k=1
Ii,kWk)>Vj ].
(B.2)
Expanding the quadratic in the final term and dropping terms independent of Wm, we obtain
82
Appendix B. Constrained PMF 83
N∑i=1
M∑j=1
−Ii,jαiβjτ2
[−2ri,jV>j δW
Ii,mni
Wm + 2δUδWV>j (Ui
Ii,mni
W>m)Vj
+ δ2WV>j (
Ii,mWm
ni+
∑k 6=m Ii,kWk
ni)(Ii,mWm
ni+
∑k 6=m Ii,kWk
ni)>)Vj .
(B.3)
Note the sum over Wk has been separated into the term involving Wm and the sum over the other
Wk, k 6= m.
Rearranging vectors to place it in the form µ>wΛwWm
N∑i=1
M∑j=1
−Ii,jαiβjτ2
[−2δW ri,jV>j
Ii,mni
Wm + 2δUδWU>i (VjV
>j )
Ii,mni
Wm
+ δ2WW>m(
Ii,mni
)2VjV>j Wm + 2δ2W
Ii,mni
(
∑k 6=m Ii,kWk
ni)>VjV
>j Wm.
(B.4)
Adding the log prior (Wm − µw)>Λw(Wm − µw)/2 and grouping terms linear in Wm and quadratic
in Wm, we obtain the system
ΛWm=Λw + δ2W τ
N∑i=1
M∑j=1
Ii,jIi,mαiβjn2i
VjV>j
µWm=Λ−1Wm
[Λwµw
+ τ
N∑i=1
M∑j=1
Ii,jαiβj
(ri,jVjδW
Ii,mni− δUδW (VjV
>j )Ui
Ii,mni
− δ2WVjV >jIi,mn2i
(∑k 6=m
Ii,kWk)
)].
(B.5)
Rewriting,
µWm=Λ−1Wm
[Λwµw + (1− u)τ
∑(i,j):
Ii,jIi,m=1
Ii,kαiβjni
Vj
((ri,j − δUV >j Ui)− δWV >j
∑k 6=m Ii,kWk
ni
)].
(B.6)
This can be re-expressed in a more compact form by defining the prediction made without Wm as
ri,j,−Wm =
δUUi +δWni
∑k 6=m
Ii,kWk
> Vj .This can be interpreted as the inner product between the jth feature vector and the error made by
all feature vectors with the mth side feature omitted.
This shorthand allows Equation (B.6) to be expressed as
µWm =Λ−1Wm
Λwµw + δW τ∑(i,j):
Ii,jIi,m=1
αiβjni
Vj (ri,j − ri,j,−Wm))
. (B.7)
Appendix B. Constrained PMF 84
B.0.6 Conditional Posterior for User Feature
With the inclusion of side features Wk, the log posterior for Ui becomes
log p(Ui| · · · ) =ταi2
M∑j=1
Ii,jβj(ri,j − ri,j)2 + (Ui − µU )ΛU (Ui − µU ). (B.8)
Expanding the squared term yields
(ri,j − ri,j)2
=
ri,j −(δUUi +δWni
M∑k=1
Ii,kWk
)>Vj
> ·ri,j −(δUUi +
δWni
M∑k=1
Ii,kWk
)>Vj
=
[(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)− δUU>i Vj
]>·
[(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)− δUU>i Vj
]
=
(ri,j − δWni
M∑k=1
Ii,kW>k Vj
)>− δUV >j Ui
· [(ri,j − δWni
M∑k=1
Ii,kW>k Vj
)− δUU>i Vj
]
=
(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)>(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)
− 2u
(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)>V >j Ui + δ2UU
>i VjV
>j Ui.
(B.9)
Plugging into Equation (B.8) and dropping terms not involving Ui yields
log p(Ui| · · · ) =ταi2
M∑j=1
Ii,jβj
−2u
(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)>V >j Ui + δ2UU
>i VjV
>j Ui
+ (Ui − µU )ΛU (Ui − µU ).
(B.10)
This shows that the conditional posterior for Ui is Gaussian with parameters
ΛUi =ΛU + δUταi
M∑j=1
Ii,jβjVjV>j
µUi =Λ−1Ui
ΛUµU + δUταi
M∑j=1
Ii,jβjVj
(ri,j − δWV >j
(∑Mk=1 Ii,kWk
ni
)) . (B.11)
Note that the inclusion of side information affects only the mean, not the precision.
Appendix C
Distributional form of the
Variational Approximation
In this section, we derive the optimal variational distribution under the mean field approximation of
equation (2.23). The subsections are as follows:
• In Subsection C.1, we derive the optimal variational distribution for the user features;
• In Subsection C.3, we derive the optimal variational distribution for the item features;
• In Subsection C.4, we derive the optimal variational distribution for the side features;
• In Subsection C.5, we derive the optimal variational distribution for the user, item, and global
precisions;
• In Subsection C.6, we derive the optimal variational distribution for the user hyperparameters. By
symmetry, the results for item and side hyperparameters follow immediately.
C.1 User Feature Vectors
For the user feature vectors, the terms involving Ui are the conditional expectation for the rating ri,j
and the prior for the feature vector Ui. We then have
M∑j=1
Ii,j log p(ri,j |Ui, Vj ,W1:m, αi, βj , τ) + log p(Ui|µU ,ΛU )
=− Ii,j2ταi
M∑j=1
βj(ri,j − ri,j)2 +1
2log |ΛU | −
1
2(Ui − µU )>ΛU (Ui − µU )
=− Ii,j2ταi
M∑j=1
βj
(−2δU
(ri,j −
δWni
M∑k=1
Ii,kW>k Vj
)V >j Ui + δUU
>i VjV
>j Ui
)
− 1
2(Ui − µU )>ΛU (Ui − µU ).
(C.1)
This shows the variational distribution for Ui is Gaussian, with parameters
85
Appendix C. Distributional form of the Variational Approximation 86
µUi =Λ−1Ui
ΛUµU + δUταi
M∑j=1
Ii,jβjVj
(ri,j − δWV >j
(∑Mk=1 Ii,kWk
ni
))ΛUi =ΛU + δUταi
M∑j=1
Ii,jβjVjV>j .
(C.2)
C.2 User Offset
If we include a user offset γi, then the relevant terms are
M∑j=1
Ii,j log p(ri,j | · · · ) + log p(γi|µγ , λγ)
M∑j=1
Ii,j2
log |ΛU | −Ii,j2ταi
M∑j=1
(ri,j − ri,j)2 +1
2log λγ −
λγ2
(γi − µγ)2.
(C.3)
The quadratic (ri,j − ri,j)2 can be rewritten as
(ri,j − ri,j)2 =(ri,j − γi − ηj − S>i Vj)2
=γ2i − 2γi(ri,j − ηj − S>i Vj) + (ri,j − ηj − S>i Vj)2.
Inserting into Equation C.3, taking expectations, and retaining terms involving γi only, we get that
the optimal distribution for γi is univariate Gaussian, with parameters
λγi =ταi
M∑j=1
Ii,jβj + λγ
µγi =λ−1γi
λγµγ + ταi
M∑j=1
Ii,jβj(ri,j − ηj − S>i Vj)
.
(C.4)
C.3 Item Feature Vectors
By symmetry, the terms involving Vj are
N∑i=1
Ii,j log p(ri,j |Ui, Vj ,W1:m, αi, βj , τ) + log p(Vj |µV ,ΛV )
=− Ii,j2βjτ
N∑i=1
αi(ri,j − ri,j)2 +1
2log |ΛV | −
1
2(Vj − µV )>ΛV (Vj − µV )
=− Ii,j2βjτ
N∑i=1
αi
−2ri,j
(δUUi +
δWni
M∑k=1
Ii,kWk
)>Vj
+ V >j
(δUUi +
δWni
M∑k=1
Ii,kWk
)(δUUi +
δWni
M∑k=1
Ii,kWk
)>Vj
− 1
2(Vj − µV )>ΛV (Vj − µV ).
(C.5)
Appendix C. Distributional form of the Variational Approximation 87
This shows the variational distribution for Vj is Gaussian, with parameters
µVj =Λ−1Vj
[ΛV µV + τβj
N∑i=1
Ii,jαi(ri,j − (δUUi +δWni
M∑k=1
Ii,kWk)>Vj)
]
ΛVj =ΛV + τβj
N∑i=1
Ii,jαi(δUUi +δWni
M∑k=1
Ii,kWk)(δUUi +δWni
M∑k=1
Ii,kWk)>.
(C.6)
C.4 Side Feature Vectors
The terms involving Wm are
N∑i=1
M∑j=1
Ii,j log p(ri,j |Ui, Vj ,W1:m, αi, βj , τ) + log p(Wm|µW ,ΛW )
=τ
2
N∑i=1
M∑j=1
Ii,jαiβj(ri,j − ri,j)2 −1
2(Wm − µW )>ΛW (Wm − µW )
=− τ
2
N∑i=1
M∑j=1
Ii,jαiβj
(−2
δWniIi,m(ri,j − ri,j,−Wm)V >j Wm +
δWn2i
W>mVjV>j Wm
)− 1
2(Wm − µW )>ΛW (Wm − µW ),
(C.7)
where ri,j,−Wm=(δUUi + δW
ni
∑k 6=m Ii,kWk
)>Vj denotes the prediction made without Wm. This
shows the variational distribution for Wm is Gaussian with parameters
µWm=Λ−1Wm
Λwµw + δW τ∑(i,j):
Ii,jIi,m=1
αiβjni
Vj (ri,j − ri,j,−Wm)
ΛWm
=ΛW + δW τ∑(i,j):
Ii,jIi,m=1
αiβjn2i
VjV>j .
(C.8)
Note the product of the two indicators Ii,jIi,m. The side information sum only considers those users
who rated this item, and then considers those items these users rated.
C.5 Precisions
The terms involving αi are
M∑j=1
Ii,j log p(ri,j |Ui, Vj ,W1:m, αi, βj , τ) + log p(αi|aU , bU )
=
M∑j=1
Ii,j2
logαi −τ
2
M∑j=1
Ii,jβj(ri,j − ri,j)2 + (aU − 1) logαi − bUαi.
(C.9)
This shows the variational distribution for αi is Gamma, with parameters
Appendix C. Distributional form of the Variational Approximation 88
aUi =aU +1
2
M∑j=1
Ii,j
bUi =bU +τ
2
M∑j=1
Ii,jβj(ri,j − ri,j)2.
(C.10)
Identical derivations show the variational distributions for βj and τ are Gamma, with parameters
aVj =aV +1
2
N∑i=1
Ii,j
bVj =bV +τ
2
N∑i=1
Ii,jαi(ri,j − ri,j)2
aτ =aτ +1
2
N∑i=1
M∑j=1
Ii,j
bτ =bτ +1
2
N∑i=1
M∑j=1
Ii,jαiβj(ri,j − ri,j)2.
(C.11)
C.6 User / Item / Side Feature Hyperparameters
The terms involving the user hyperparameters µU ,ΛU are
N∑i=1
log p(Ui|µU ,ΛU ) + log p(µU |µ0, β0ΛU ) + log(ΛU |ν0,W0)
=N
2log |ΛU | −
1
2
N∑i=1
(Ui − µU )>ΛU (Ui − µU )
+1
2log ΛU −
β02
(µU − µ0)>ΛU (µU − µ0)
+ν0 − d− 1
2log |ΛU | −
1
2tr(W−10 ΛU
).
(C.12)
Using derivations involving the completion of the square present in the literature [6], the quadratic
terms can be rearranged as
Appendix C. Distributional form of the Variational Approximation 89
N∑i=1
(Ui − µU )>ΛU (Ui − µU ) + β0(µU − µ0)>ΛU (µU − µ0)
=tr
([N∑i=1
(Ui − µU )(Ui − µU )> + β0(µU − µ0)(µU − µ0)>
]ΛU
)
=tr
([N(U − µU )(U − µU )> +
N∑i=1
(Ui − U)(Ui − U)> + β0(µU − µ0)(µU − µ0)
]ΛU
)
=tr
([(N + β0)(µU − µU )(µU − µU )> +
Nβ0N + β0
(U − µ0)(U − µ0)> +
N∑i=1
(Ui − U)(Ui − U)>
]ΛU
),
(C.13)
where we have defined
µU =NU + β0µ0
N + β0. (C.14)
We can now write the µU ,ΛU terms as
N∑i=1
log p(Ui|µU ,ΛU ) + log p(µU |µ0, β0ΛU ) + log(ΛU |ν0,W0)
=1
2log |ΛU | −
1
2(µU − µU )>[(N + β0)ΛU ](µU − µU )
+N + ν0 − d− 1
2log |ΛU |
− 1
2tr
(W−10 +
[Nβ0N + β0
(U − µ0)(U − µ0)> +
N∑i=1
(Ui − U)(Ui − U)>
]ΛU
).
(C.15)
Which shows the variational distribution for (µU ,ΛU ) is a Normal-Wishart with parameters
µU =NU + β0µ0
N + β0
ΛU =(N + β0)ΛU
νU =N + ν0
W−1U =W−10 +Nβ0N + β0
(U − µ0)(U − µ0)> +
N∑i=1
(Ui − U)(Ui − U)>.
(C.16)
Analogous statements (with the appropriate sample size, feature averages, etc) hold for the item and
side feature vectors.
Appendix D
Derivation of the Variational Lower
Bound
Chapter 2 defined the basic matrix factorization model for collaborative filtering and discussed the
extension of the constrained PMF model to the Bayesian framework. We outlined inference for the
relevant paramters under both Gibbs sampling and a variational mean field approximation. Chapter 3
extended this to a collection of heteroskedastic models. In this appendix, we derive the variational lower
bound for these models.
From Section 2.4.2, the lower bound takes the form
EQ[log p(θ,D)]−H(Q), (D.1)
where EQ denotes the expectation under the variational approximation Q, known as the expected
complete log-likelihood, and H(Q) denote the entropy of the distribution.
D.1 Complete Log-Likelihood
From the definition of the model, the expected complete log-likelhood is
90
Appendix D. Derivation of the Variational Lower Bound 91
EQ [log p(θ,D)]
=EQ [log p(U1:N , V1:N ,W1:K , α1:N , β1:N , τ, γ1:N , η1:M , µU ,ΛU , µV ,ΛV , µW ,ΛW , R)]
=
N∑i=1
M∑j=1
Ii,jEQ [log p(ri,j | U, V,W, γ, η, α, β, τ)]
+
N∑i=1
EQ [log p(Ui | µU ,ΛU )]
+
M∑j=1
EQ [log p(Vj | µV ,ΛV )]
+
M∑k=1
EQ [log p(Wk | µW ,ΛW )]
+
N∑i=1
EQ [log p(γi | µγ , λγ)]
+
M∑j=1
EQ [log p(ηj | µη, λη)]
+
N∑i=1
EQ [log p(αi | aU , bU )]
+
M∑j=1
EQ [log p(βj | aV , bV )]
+ EQ [log p(τ | aτ , bτ )]
+ EQ [log p(µU ,ΛU | µ0, β0, ν0,W0)]
+ EQ [log p(µV ,ΛV | µ0, β0, ν0,W0)]
+ EQ [log p(µW ,ΛW | µ0, β0, ν0,W0)] .
(D.2)
We will analyze each of the expectations in Equation (D.27) individually in the appropriately named
sections that follow.
D.1.1 Rating
For the conditional density of the rating,
EQ[Ii,j logP (ri,j |r−(i,j), Ui, Vj ,W1:m, αi, βj , τ
]=Ii,j2
EQ[logαi + log βj + log τ − ταiβj(ri,j − ri,j)2
]=Ii,j2
(EQ[logαi] + EQ[log βj ] + EQ[log τ ]
− EQ[τ ]EQ[αi]EQ[βj ][EQ[(ri,j − ri,j)2
]).
(D.3)
Expanding the quadratic
Appendix D. Derivation of the Variational Lower Bound 92
EQ[(ri,j − ri,j)2] =EQ[r2i,j − 2ri,j ri,j + r2i,j ]
=r2i,j − 2ri,jEQ[ri,j ] + EQ[r2i,j ].(D.4)
Linearity of expectation and the independence assumption from the variational approximation gives
a simple result for the second term,
−2ri,jEQ[ri,j ] =− 2ri,jEQ
(δUUi + δW
∑Mk=1 Ii,kWk
ni
)>Vj
=− 2ri,j
(δUEQ[U>i Vj ] +
δWni
M∑k=1
Ii,kEQ[W>k Vj
])
=− 2ri,j
(δUEQ[Ui]
>EQ[Vj ] +δWni
M∑k=1
Ii,kEQ[Wk]>EQ[Vj ]
)
=− 2ri,j
(δUµ
>UiµVj +
δWni
M∑k=1
Ii,kµ>WkµVj
).
(D.5)
For the second moment, expanding the square leads to three additional terms
EQ[r2i,j ] =δUEQ[V >j UiU
>i Vj
]+ 2
δUδWni
M∑k=1
Ii,kEQ[V >j UiW>k Vj ]
+
(δWni
)2
EQ
[(V >j
M∑k=1
Ii,kWk
)(M∑`=1
Ii,`W>` Vj
)].
(D.6)
For the first involving only user and items,
δUEQ[V >j UiU
>i Vj
]=δU tr
(EQ[VjV
>j [UiU
>i ]])
=δU tr(EQ[VjV
>j ]EQ[UiU
>i ])
=δU tr((VarQ[Vj ] + EQ[Vj ]EQ[Vj ]
>)(VarQ[Ui] + EQ[Ui]EQ[Ui]>))
=δU tr(
(Λ−1Vj + µVjµ>Vj )(Λ
−1Ui
+ µUiµ>Ui)).
(D.7)
For the term involving user, item and side features,
2δUδWni
M∑k=1
Ii,kEQ[V >j UiW>k Vj ]
=2δUδWni
M∑k=1
Ii,ktr(EQ[VjV
>j UiW
>k ])
=2δUδWni
M∑k=1
Ii,ktr(EQ[VjV
>j ]EQ[Ui]EQ[Wk]>]
)=2
δUδWni
M∑k=1
Ii,ktr(
(Λ−1Vj + µVjµ>Vj )µUiµ
>Wk
).
(D.8)
Appendix D. Derivation of the Variational Lower Bound 93
For the final term involving only item and side features,
(δWni
)2
EQ[(V >j
M∑k=1
Ii,kWk)(
M∑`=1
Ii,`W>` Vj)]
=
(δWni
)2
EQ[(V >j (
M∑k=1
Ii,kWkW>k +
∑k 6=`
Ii,kIi,`WkW>` )Vj)]
=
(δWni
)2( M∑k=1
Ii,kEQ[V >j WkW>k Vj ] +
∑k 6=`
Ii,kIi,`EQ[V >j WkW>` Vj ]
)
=
(δWni
)2( M∑k=1
Ii,ktr(EQ[VjV
>j WkW
>k ])
+∑k 6=`
Ii,kIi,`tr(EQ[VjV
>j WkW
>` ]))
=
(δWni
)2( M∑k=1
Ii,ktr(EQ[VjV
>j ]EQ[WkW
>k ])
+∑k 6=`
Ii,kIi,`tr(EQ[VjV
>j ]EQ[Wk]EQ[W>` ]
))
=
(δWni
)2( M∑k=1
Ii,ktr(
(Λ−1Vj + µVjµ>Vj )(Λ
−1Wk
+ µWkµ>Wk
))
+∑k 6=`
Ii,kIi,`tr(
(Λ−1Vj + µVjµ>Vj )µWk
µ>W`
)).
(D.9)
Combining Equations (D.9), (D.7), (D.8) with the first order term in Equation (D.5) yields
EQ[(ri,j − ri,j)2]
=r2i,j
− 2ri,j
(δUµ
>UiµVj +
δWni
M∑k=1
Ii,kµ>WkµVj
)+ δU tr
((Λ−1Vj + µVjµ
>Vj )(Λ
−1Ui
+ µUiµ>Ui))
+ 2δUδWni
M∑k=1
Ii,ktr(
(Λ−1Vj + µVjµ>Vj )µUiµ
>Wk
)+
(δWni
)2 [ M∑k=1
Ii,ktr(
(Λ−1Vj + µVjµ>Vj )(Λ
−1Wk
+ µWkµ>Wk
))
+∑k 6=`
Ii,kIi,`tr(
(Λ−1Vj + µVjµ>Vj )µWk
µ>W`
)].
(D.10)
Combining with the precision factors yields Equation (D.3),
Appendix D. Derivation of the Variational Lower Bound 94
EQ[Ii,j log p(ri,j |r−(i,j), Ui, Vj ,W1:m, αi, βj , τ)
]=
1
2
N∑i=1
M∑j=1
Ii,j
([− log bUi + ψ(aUi)− log bVj + ψ(aVj )− log bτ + ψ(aτ )
]
− aτ
bτ
aUibUi
aVj
bVj
[r2i,j − 2ri,j
(δUµ
>UiµVj +
δWni
M∑k=1
Ii,kµ>WkµVj
)+ δU tr
((Λ−1Vj + µVjµ
>Vj )(Λ
−1Ui
+ µUiµ>Ui))
+ 2δUδWni
M∑k=1
Ii,ktr(
(Λ−1Vj + µVjµ>Vj )µUiµ
>Wk
)+
(δWni
)2( M∑k=1
Ii,ktr(
(Λ−1Vj + µVjµ>Vj )(Λ
−1Wk
+ µWkµ>Wk
))
+∑k 6=`
Ii,kIi,`tr(
(Λ−1Vj + µVjµ>Vj )µWk
µ>W`
))]).
(D.11)
D.1.2 User Features
For the conditional density of the user latent features,
EQ [log p(Ui|µU ,ΛU )]
=1
2EQ[log |ΛU |]−
1
2EQ[(Ui − µU )>ΛU (Ui − µU )].
(D.12)
For the quadratic form, we use conditional expectation as (µU ,ΛU ) is jointly a Normal-Wishart under
the variational approximation, hence not independent.
EQ[(Ui − µU )>ΛU (Ui − µU )]
=EQ[U>i ΛUUi − 2U>i ΛUµ>U + µ>UΛUµU ]
=EQ[U>i ΛUUi]− 2EQ[Ui]>EQ[ΛUµ
>U ] + EQ[µ>UΛUµU ].
(D.13)
Using the trace on the first term yields,
EQ[U>i ΛUUi] =tr(EQ[ΛUUiU
>i ])
=tr(EQ[EQ[ΛUUiU
>i |ΛU ]]
)=tr
(EQ[ΛUEQ[UiU
>i |ΛU ]]
)=tr
(EQ[ΛU ]
(VarQ[Ui|ΛU ] + EQ[Ui]EQ[Ui]
>))=tr
(νUWU
(Λ−1Ui + µUiµ
>Ui
)).
(D.14)
Iterated expectation on the second gives
EQ[Ui]>EQ[ΛUµ
>U ] =EQ[Ui]
>EQ[EQ[ΛUµ>U |ΛU ]]
=EQ[Ui]>EQ[ΛU ]EQ[µU |ΛU ]>
=µ>Ui ν0WU µU .
(D.15)
While both techniques applied to the third yields,
Appendix D. Derivation of the Variational Lower Bound 95
EQ[µ>UΛUµU ] =tr(EQ[ΛUµUµ
>U ])
=tr(EQ[EQ[ΛUµUµ
>U |ΛU ]]
)=tr
(EQ[ΛU ]EQ[µUµ
>U |ΛU ]
)=tr
(EQ[ΛU ]
(VarQ[µU |ΛU ] + EQ[µU |ΛU ]EQ[µU |ΛU ]>
))=tr
(νUWU
(ΛU + µU µ
>U
)).
(D.16)
This simplifies to
EQ[(Ui − µU )>ΛU (Ui − µU )]
=νU
[(µUi − µU )>WU (µUi − µU ) + β0
−1tr(WU Λ−1Ui
)].
(D.17)
The log-precision expectation gives
EQ[log |ΛU |] =
d∑i=1
ψ
(νU + 1− i
2
)+ d log 2 + log |WU |. (D.18)
Combining Equation (D.13) - (D.18) and dividing by two gives the contribution to the variational
lower bound from the user features,
EQ [log p(Ui|µU ,ΛU )]
=1
2
[ d∑i=1
ψ
(νU + 1− i
2
)+ d log 2 + log |WU |
− νU[(µUi − µU )>WU (µUi − µU ) + β0
−1tr(WU Λ−1Ui
)].
(D.19)
D.1.3 User Precision
For the conditional density of the user precision
EQ [log p(αi|aU , bU )]
=EQ [aU log bU − log Γ(aU ) + (aU − 1) logαi − bUαi]
=C + (aU − 1)EQ [logαi]− bUE [αi]
=C + (aU − 1)(− log bUi + ψ(aUi))− bUaUibUi
=C + (aU − 1)(− log bUi + ψ(aUi))− bUaUibUi
,
(D.20)
where ψ(·) is the Digamma function, ψ(·) = dd· log Γ(·).
D.1.4 User Bias
For the user bias γi, the contribution to the variational lower bound is
Appendix D. Derivation of the Variational Lower Bound 96
EQ[log p(γi)]
=1
2EQ[log λγ ]− λγ
2EQ[(γi − µγ)2]
=1
2log λγ −
λγ2
[VarQ[γi] + (EQ[γi]− µγ)
2]
=1
2log λγ −
λγ2
[λγi + (µγi − µγ)
2].
(D.21)
The item bias contributions are analogous.
D.1.5 User Hyperparamters
For the conditional density of the user hyperparameters (µU ,ΛU ), the contribution to the variational
lower bound is
EQ[log p(µU ,ΛU )]
=EQ[log p(µU |µ0, β0ΛU )] + EQ[log p(ΛU |ν0,W0)].(D.22)
The first term contains a factor of log |ΛU |, derived in Equation (D.18), and the quadratic with
respect to µU .
For the quadratic term, we rearrange under the trace to obtain
EQ[(µU − µ0)>β0ΛU (µU − µ0)]
=β0EQ[tr(ΛU (µU − µ0)(µU − µ0)>
)]
=β0tr(EQ[ΛU ]EQ[(µU − µ0)(µU − µ0)>]
)=β0tr
(EQ[ΛU ] · VarQ[µU − µ0] + EQ[µU − µ0]EQ[µU − µ0]>
)=β0tr
(EQ[ΛU ] · VarQ[µU ] + (EQ[µU ]− µ0)(EQ[µU ]− µ0)>
)=β0tr
(νUWU · Λ−1U + (µU − µ0)(µU − µ0)>
)=νUβ0
(tr(WU Λ−1U
)+ (µU − µ0)>WU (µU − µ0)
).
(D.23)
Subtracting Equation (D.23) from Equation (D.18) and dividing by two gives the contribution to
the lower bound from the conditional distribution for the user latent feature mean, the first term in
Equation (D.22).
EQ [log p(µU |µ0, β0ΛU )]
=1
2
[ d∑i=1
ψ
(νU + 1− i
2
)+ d log 2 + log |WU |
− νUβ0(
tr(WU Λ−1U
)+ (µU − µ0)>WU (µU − µ0)
)].
(D.24)
For the second term in Equation (D.22), the Wishart on the user precision matrix, we have
Appendix D. Derivation of the Variational Lower Bound 97
EQ [log p(ΛU |W0, ν0)]
=ν0 − d− 1
2EQ [log |ΛU |]−
1
2
(tr(EQ[W−10 ΛU ]
))=ν0 − d− 1
2EQ [log |ΛU |]−
1
2tr(W−10 EQ[ΛU ]
)=ν0 − d− 1
2
[d∑i=1
ψ
(νU + 1− i
2
)+ p log 2 + log |WU |
]
− νU2
tr(W−10 WU
).
(D.25)
Combining Equations (D.24) and (D.25) yield the contribution of interest, Equation (D.22), as
EQ[log(p(µU ,ΛU )]
=EQ[log p(µU |µ0, β0ΛU )] + EQ[log p(ΛU |ν0,W0)]
=1
2
[ d∑i=1
ψ
(νU + 1− i
2
)+ d log 2 + log |WU |
− νUβ0(
tr(WU Λ−1U
)+ (µU − µ0)>WU (µU − µ0)
)]ν0 − d− 1
2
[d∑i=1
ψ
(νU + 1− i
2
)+ p log 2 + log |WU |
]
− νU2
tr(W−10 WU
).
(D.26)
D.2 Etropy
Given the mean field approximation, the entropy factorizes into a series of terms,
Appendix D. Derivation of the Variational Lower Bound 98
H[Q] =− EQ[logQ(τ, α1:N , β1:M , U1:N , V1:M ,W1:M , µU ,ΛU , µV , λV , µW ,ΛW )]
=− EQ [logQ(τ)]
−N∑i=1
EQ [logQ(Ui)]
−N∑i=1
EQ [logQ(αi)]
−N∑i=1
EQ [logQ(γi)]
−M∑j=1
EQ [logQ(Vj)]
−M∑j=1
EQ [logQ(βj)]
−M∑j=1
EQ [logQ(ηj)]
−M∑k=1
EQ [logQ(Wk)]
− EQ [logQ(µU ,ΛU )]
− EQ [logQ(µV ,ΛV )]
− EQ [logQ(µW ,ΛW )] .
(D.27)
We will analyze each of the expectations in Equation (D.27) individually in the appropriately named
sections that follow.
D.2.1 Feature Vectors
We derive the contribution from a single user feature.
EQ[logQ(Ui)]
=1
2EQ[log |ΛUi ]−
1
2EQ[(Ui − µUi)>ΛUi(Ui − µUi)]
=1
2log |ΛUi |.
(D.28)
The first is parameter, hence constant, while the second term is zero as it is an expectation of a
quadratic form centered by the mean and scaled by the precision, see Section A.2.1.
D.2.2 Precision Terms
We derive the contribution from the global precision factor. The other precisions follow analogously.
Appendix D. Derivation of the Variational Lower Bound 99
EQ[logQ(τ)]
=aτ log bτ + log Γ(aτ ) + (aτ − 1)EQ[log τ ]− bτEE [τ ]
=aτ log bτ + log Γ(aτ ) + (aτ − 1)(− log bτ + ψ(aτ ))− bτaτ
bτ
=aτ log bτ + log Γ(aτ ) + (aτ − 1)(− log bτ + ψ(aτ ))− aτ=− aτ − log bτ − log Γ(aτ )− (aτ − 1)ψ(aτ ).
(D.29)
D.2.3 User Bias
For the user bias γi, the contribution to the entropy is
EQ[logQ(γi)]
=1
2EQ[log λγi ]−
λγi2
EQ[(γi − µγi)2]
=1
2log λγi −
λγi2
VarQ[γi]
=1
2log λγi −
λγi2
1
λγi
=1
2log λγi −
1
2.
(D.30)
The item bias contributions are analogous.
Hyperparameters
We derive the contribution from the user hyperparameters (µU ,ΛU ),
EQ [logQ(µU ,ΛU )]
=EQ[logQ(µU |ΛU )] + EQ[logQ(ΛU )]
=EQ
[β02
log |ΛU | −β02
(µU − µU )>ΛU (µU − µU )
]
+ EQ
[− νU
2log |WU |+
νU − d− 1
2log |ΛU | −
1
2tr(W−1U ΛU
)]=EQ
[β02
log |ΛU |]− EQ
[β02
(µU − µU )>ΛU (µU − µU )
]
+ EQ
[− νU
2log |WU |
]+ EQ
[νU − d− 1
2log |ΛU |
]− EQ
[1
2tr(W−1U ΛU
)].
(D.31)
The second expectation involving the quadratic form is zero, as before, while the third is a constant.
The remaining terms contribute
Appendix D. Derivation of the Variational Lower Bound 100
EQ [logQ(µU ,ΛU )]
=β02
[d∑i=1
ψ
(νU + 1− i
2+ d log 2
)+ log |WU |
]− 0
− νU2
log |WU |+νU − d− 1
2
[d∑i=1
ψ
(νU + 1− i
2
)+ d log 2 + log |WU |
]− 1
2tr(W−1U νUWU
)=νU − d
2
[d∑i=1
ψ
(νU + 1− i
2
)+ d log 2
]− d
2log |W−1U | −
νUd
2.
(D.32)
Appendix E
Meta Constrained PMF
In this appendix, we derive the sampling distribution of the meta-feature vectors, or side features, for
the meta-constrained PMF model of Chapter 4. In addition, we derive the sampling distribution of the
user features in the presence of the meta-features.
E.1 Meta Features
Under the model, the terms involving the sampling distribution for Wk is the prior for Wk and all user
features,
log p(Wk | · · · ) =
N∑i=1
log p(Ui | · · · ) + log p(Wk)
=Constant− 1
2
∑i
(Ui − µU − ‖fi‖−1∑k
Wkfk,i)>ΛU (Ui − µU − ‖fi‖−1
∑k
Wkfk,i)
− 1
2(Wk − µW )>ΛW (Wk − µW )
=− 1
2
∑i
(Ui − µU − ‖fi‖−1∑j 6=k
Wjfj,i)>ΛU (Ui − µU − ‖fi‖−1
∑j 6=k
Wjfj,i)
− 2‖fi‖−1Wkfk,iΛU (Ui − µU − ‖fi‖−1∑j 6=k
Wjfj,i) + ‖fi‖−2fk,iW>k ΛUWk
− 1
2
[W>k ΛWWk − 2W>k ΛWµW + µ>WΛWµW
].
(E.1)
This expression is a quadratic in Wk. Identifying the terms linear and quadratic in Wk allows the
distribution to be determined.
Quadratic: ΛW + ΛU
N∑i=1
f2k,i
Linear: ΛU
N∑i=1
‖fi‖−1fk,i(Ui − µU − ‖fi‖−1∑j 6=k
Wjfj,i) + ΛWµW .
(E.2)
101
Appendix E. Meta Constrained PMF 102
Therefore, the sampling distribution of Wk is Gaussian, with parameters
ΛWk=ΛW + ΛU
N∑i=1
f2k,i
µWk=Λ−1Wk
ΛU
N∑i=1
‖fi‖−1fk,i(Ui − µU − ‖fi‖−1∑j 6=k
Wjfj,i) + ΛWµW
.
(E.3)
E.2 User Features
Under the model, the terms involving the sampling distribution for Wi is the prior for Ui and the
likelihood of all ratings ri,j for user i,
log p(Ui | · · · ) = log p(Ui | · · · ) +∑j
log p(ri,j | · · · )
=− 1
2(Ui − µU − ‖fi‖−1
∑k
Wkfk,i)>ΛU (Ui − µU − ‖fi‖−1
∑k
Wkfk,i)
− 1
2
∑j
Ii,j(ri,j − U>i Vj)>τ(ri,j − U>i Vj)
=− 1
2
U>i ΛuUi − 2UiΛU (µU + ‖fi‖−1
∑k
Wkfk,i)
+ (µU + ‖fi‖−1∑k
Wkfk,i)>ΛU (µU + ‖fi‖−1
∑k
Wkfk,i)
− 1
2τ∑j
Ii,j[r2i,j − 2ri,jU
>i Vj + V >j UiU
>i Vj
].
(E.4)
This expression is a quadratic in Ui. Identifying the terms linear and quadratic in Ui allows the
distribution to be determined.
Quadratic: ΛU + τ∑j
Ii,jVjV>j
Linear: τ∑j
Ii,jri,jVj + ΛU (µU + ‖fi‖−1∑k
Wkfk,i).(E.5)
Therefore, the sampling distribution of Ui is Gaussian, with parameters
ΛUi =ΛU + τ∑j
Ii,jVjV>j
µUi =Λ−1Ui
τ∑j
Ii,jri,jVj + ΛU (µU + ‖fi‖−1∑k
Wkfk,i)
.
(E.6)
Note that the precision of the user feature is unaffected by the inclusion of the side features.
Appendix F
User Features in Matrix
Factorization with User Networks
Vanilla Bayesian PMF models the ratings as Gaussian conditional on user and item features. Hierarchi-
cally, these features are given Gaussian-Wishart priors
(ri,j | Ui, Vj , τ) ∼N (ri,j | γi + ηj + U>i Vj , τ)
(Ui | µU ,ΛU ) ∼N (Ui | µU ,ΛU )
(Vj | µV ,ΛV ) ∼N (Vj | µV ,ΛV )
(µU ,ΛU ) ∼N (µU | µ0, β0ΛU )W(ΛU | ν0,W0)
(µV ,ΛV ) ∼N (µV | µ0, β0ΛV )W(ΛV | ν0,W0).
(F.1)
When user networks are present, we have considered a modified version of the above.
(ri,j | Si, Vj , τ) ∼N (ri,j | S>i Vj , τ)
(Ui | µU ,ΛU ) ∼N (Ui | µU ,ΛU )
(Si | µS ,ΛS) ∼N (Si | µS + ‖ai‖−1∑j 6=i
Ujai,j ,ΛS)
(Vj | µV ,ΛV ) ∼N (Vj | µV ,ΛV )
(µU ,ΛU ) ∼N (µU | µ0, β0ΛU )W(ΛU | ν0,W0)
(µS ,ΛS) ∼N (µS | µ0, β0ΛS)W(ΛS | ν0,W0)
(µV ,ΛV ) ∼N (µV | µ0, β0ΛV )W(ΛV | ν0,W0),
(F.2)
where ai,j ∈ 0, 1 indicate the presence of an edge between users i and j. In this appendix, we
derive the posterior sampling distribution for Ui and Si, conditional on the rating data and the user
network.
103
Appendix F. User Features in Matrix Factorization with User Networks 104
F.1 Sampling Distribution for Ui
The conditional posterior distribution for Ui will involve two terms: the prior for Ui, and the prior for
Sk for all users k that are connected to i. Since each of these terms are Gaussian in distribution, the
conditional posterior will be Gaussian. In what follows, we expand each term individually, recognize the
linear and quadratic factors, and finally combine the three sets to obtain the required distribution.
F.1.1 Prior for Ui
The log-prior for Ui contributes the following
log p(Ui) ∝(Ui − µU )>ΛU (Ui − µU )
=[U>i ΛUUi − 2U>i ΛUµU + µ>UΛUµU
].
(F.3)
This contributes the following linear and quadratic terms
Linear: ΛUµU
Quadratic: ΛU(F.4)
F.1.2 Prior for Sk
The log-prior for Sk contributes the following
log p(Si) +∑k 6=i
log p(Sk). (F.5)
It is necessary to distinguish the case k = i and k 6= i as the Ui term appears differently in each.
For k = i
log p(Si) ∝(Si − µS − Ui − ‖ai‖−1∑j
Ujai,j)>ΛS(Si − µS − Ui − ‖ai‖−1
∑j
Ujai,j)
=(Si − µS − ‖ai‖−1∑j
Ujai,j)>ΛS(Si − µS − ‖ai‖−1
∑j
Ujai,j)
− 2U>i ΛS(Si − µS − ‖ai‖−1∑j
Ujai,j)
+ U>i ΛSUi.
(F.6)
This contributes the following linear and quadratic terms
Linear: ΛS(Si − µS − ‖ai‖−1∑j
Ujai,j)
Quadratic: ΛS
(F.7)
For k 6= i
Appendix F. User Features in Matrix Factorization with User Networks 105
log p(Sk) ∝(Sk − µS − Uk − ‖ak‖−1∑j
Ujak,j)>ΛS(Sk − µS − Uk − ‖ak‖−1
∑j
Ujak,j)
=(Sk − µS − Uk − ‖ak‖−1∑j 6=i
Ujak,j)>ΛS(Sk − µS − Uk − ‖ak‖−1
∑j 6=i
Ujak,j)
− 2‖ak‖−1ak,iU>i ΛS(Sk − µS − Uk − ‖ak‖−1∑j 6=i
Ujak,j)
+ ‖ak‖−2a2k,iU>i ΛSUi.
(F.8)
This contributes the following linear and quadratic terms
Linear: ΛS∑k 6=i
‖ak‖−1ak,i(Sk − µS − Uk − ‖ak‖∑j 6=i
Ujak,j)
Quadratic: ΛS∑k 6=i
a2k,i‖ak‖
(F.9)
The posterior sampling distribution for Ui is therefore Gaussian, with precision being the sum of
the three quadratic terms in Equations F.4, F.7, and F.9, and mean being the the inverse precision
multiplied by the three linear terms in Equation F.4, F.7.
F.1.3 Sampling Distribution for Si
The sampling distribution for Si follows from the standard Gaussian Matrix Factorization model with
only user and item features Ui, Vj , with Si in place of Ui.
Bibliography
[1] R.P. Adams, G.E. Dahl, and I. Murray. Incorporating Side Information in Probabilistic Matrix
Factorization with Gaussian Processes. Arxiv preprint arXiv:1003.4944, 2010.
[2] S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly.
Video suggestion and discovery for youtube: taking random walks through the view graph. In
Proceeding of the 17th international conference on World Wide Web, pages 895–904, 2008.
[3] P. Damien and S.G. Walker. Sampling truncated normal, beta, and gamma densities. Journal of
Computational and Graphical Statistics, 10(2):206–215, 2001.
[4] K. Drakakis, S. Rickard, R. de Frin, and A. Cichocki. Analysis of financial data using non-negative
matrix factorization. In International Mathematical Forum, volume 3, pages 1853–1870, 2008.
[5] Rana Forsati, Mehrdad Mahdavi, Mehrnoush Shamsfard, and Mohamed Sarwat. Matrix factoriza-
tion with explicit trust and distrust relationships. arXiv preprint arXiv:1408.0325, 2014.
[6] Chris Fraley and Adrian E. Raftery. Bayesian Regularization for Normal Mixture Estimation and
Model-Based Clustering. Technical Report 486, Department of Statistics, 2005.
[7] Sheetal Girase, Debajyoti Mukhopadhyay, et al. Role of Matrix Factorization Model in Collaborative
Filtering Algorithm: A Survey. arXiv preprint arXiv:1503.07475, 2015.
[8] W.T. Glaser, T.B. Westergren, J.P. Stearns, and J.M. Kraft. Consumer item matching method and
system, February 21 2006. US Patent 7,003,515.
[9] Jennifer Golbeck. FilmTrust: Movie Recommendations from Semantic Web-based Social Networks.
In ISWC2005 Posters & Demostrations, pages PID–72, 2005. printed proceedings only.
[10] Prem Gopalan, Jake M Hofman, and David M Blei. Scalable recommendation with poisson factor-
ization. arXiv preprint arXiv:1311.1704, 2013.
[11] Prem Gopalan, Francisco J Ruiz, Rajesh Ranganath, and David M Blei. Bayesian Nonparametric
Poisson Factorization for Recommendation Systems. In AISTATS, pages 275–283, 2014.
[12] Asela Gunawardana and Christopher Meek. A unified approach to building hybrid recommender
systems. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09, pages
117–124, New York, NY, USA, 2009. ACM.
[13] Guibing Guo, Jie Zhang, and Neil Yorke-Smith. A Novel Bayesian Similarity Measure for Recom-
mender Systems. In IJCAI, 2013.
106
Bibliography 107
[14] F Maxwell Harper and Joseph A Konstan. The MovieLens Datasets: History and Context. ACM
Transactions on Interactive Intelligent Systems (TiiS), 5(4):1–9, 2015.
[15] SeyedMohsen Jamali. Probabilistic Models for Recommendation in Social Networks. PhD thesis,
Applied Sciences: School of Computing Science, Simon Fraser University, 2013.
[16] Michal Kosinski, David Stillwell, and Thore Graepel. Private traits and attributes are pre-
dictable from digital records of human behavior. Proceedings of the National Academy of Sciences,
110(15):5802–5805, 2013.
[17] B. Lakshminarayanan, G. Bouchard, and C. Archambeau. Robust Bayesian Matrix Factorisation.
Journal of Machine Learning Research, 15, 2011.
[18] Joonseok Lee, Mingxuan Sun, and Guy Lebanon. A comparative study of collaborative filtering
algorithms. arXiv preprint arXiv:1205.3193, 2012.
[19] Yew Jin Lim and Yee Whye Teh. Variational Bayesian approach to movie rating prediction. In
Proceedings of KDD cup and workshop, volume 7, pages 15–21, 2007.
[20] Hao Ma, Irwin King, and Michael R Lyu. Learning to recommend with social trust ensemble.
In Proceedings of the 32nd international ACM SIGIR conference on Research and development in
information retrieval, pages 203–210. ACM, 2009.
[21] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King. Sorec: social recommendation using
probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information and
knowledge management, pages 931–940. ACM, 2008.
[22] Benjamin M Marlin. Modeling User Rating Profiles For Collaborative Filtering. In NIPS, pages
627–634, 2003.
[23] B.M. Marlin and R.S. Zemel. Collaborative prediction and ranking with non-random missing data.
In Proceedings of the third ACM conference on Recommender systems, page 512. ACM, 2009.
[24] B.M. Marlin, R.S. Zemel, S. Roweis, and M. Slaney. Collaborative filtering and the missing at
random assumption. In Uncertainty in Artificial Intelligence: Proceedings of the 23rd Conference,
volume 47, page 5054, 2007.
[25] Paolo Massa and Paolo Avesani. Trust-aware recommender systems. In RecSys ’07: Proceedings
of the 2007 ACM conference on Recommender systems, pages 17–24, New York, NY, USA, 2007.
ACM.
[26] Paolo Massa, Kasper Souren, Martino Salvetti, and Danilo Tomasoni. Trustlet, open research on
trust metrics. Scalable Computing: Practice and Experience, 9(4), 2001.
[27] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[28] Saralees Nadarajah and Samuel Kotz. R Programs for Computing Truncated Distributions. Journal
of Statistical Software, 16(Code Snippet 2), 2006.
[29] Ulrich Paquet, Blaise Thomson, and Ole Winther. A hierarchical model for ordinal matrix factor-
ization. Statistics and Computing, 22(4):945–957, 2012.
Bibliography 108
[30] I. Porteous, A. Asuncion, and M. Welling. Bayesian matrix factorization with side information and
Dirichlet process mixtures. In AAAI Conference on Artificial Intelligence, pages 563–568, 2010.
[31] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov
chain Monte Carlo. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, ICML,
volume 307 of ACM International Conference Proceeding Series, pages 880–887. ACM, 2008.
[32] Ruslan. Salakhutdinov and Andriy Mnih. Probabilistic Matrix Factorization. In Advances in Neural
Information Processing Systems, volume 20, 2008.
[33] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines for
collaborative filtering. In ICML ’07: Proceedings of the 24th international conference on Machine
learning, pages 791–798, New York, NY, USA, 2007. ACM.
[34] United States Securities and Exchange Commission. Registration No. 333-179287, 2012.
[35] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Advances
in artificial intelligence, 2009.
[36] Jiliang Tang, Huiji Gao, and Huan Liu. mTrust: discerning multi-faceted trust in a connected
world. In Proceedings of the fifth ACM international conference on Web search and data mining,
pages 93–102. ACM, 2012.
[37] Jiliang Tang, Huiji Gao, Huan Liu, and Atish Das Sarma. eTrust: Understanding trust evolution in
an online world. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 253–261. ACM, 2012.
[38] Truyen Tran, Dinh Q. Phung, and Svetha Venkatesh. Thurstonian Boltzmann Machines: Learning
from Multiple Inequalities. In ICML (2), volume 28 of JMLR Proceedings, pages 46–54. JMLR.org,
2013.
[39] Martin J. Wainwright and Michael I. Jordan. Graphical Models, Exponential Families, and Varia-
tional Inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.
[40] Jason Weston, Chong Wang, Ron J. Weiss, and Adam Berenzweig. Latent Collaborative Retrieval.
In ICML. icml.cc / Omnipress, 2012.
[41] YouTube. YouTube Statistics.
[42] Yisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversified
retrieval. In Advances in Neural Information Processing Systems, pages 2483–2491, 2011.