[IEEE 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies - Ho Chi Minh City, Vietnam (2008.07.13-2008.07.17)]

Collaborative Filtering by Multi-task Learning

Nguyen Duy Phuong, Tu Minh Phuong Faculty of Information Technology

Posts and Telecommunications Institute of Technology, Vietnam [email protected], [email protected]

Abstract Collaborative filtering is a technique to predict users’ interests for items by exploiting the behavior patterns of a group of users with similar preferences. This technique has been widely used for recommender systems and has a number of successful applications in E-commerce. In practice, a major challenge when applying collaborative filtering is that a typical user provides ratings for just a small number of items, thus the amount of training data is sparse with respect to the size of the domain. In this paper, we present a method to address this problem. Our method formulates the collaborative filtering problem in a multi-task learning framework by treating each user rating prediction as a classification problem and solving multiple classification problems together. By doing this, the method allows sharing information among different classifiers and thus reduces the effect of data sparsity.

Keywords collaborative filtering, multi-task learning, boosting

I. INTRODUCTION

Recommender systems aim to overcome information overload by generating personalized suggestions that help people find relevant products, or documents. These systems have played an important role in E-commerce and information filtering with a number of commercial systems deployed (Amazon.com, Netflix.com).

There are two standard approaches to building recommender systems: collaborative filtering [1,13], and content-based filtering [2]. Collaborative filtering systems collect user feedback in form of ratings in a given domain. The systems then use these data to find users with most similar profiles and use their ratings to predict ratings for new items. On the other hand, content-based filtering provides recommendations by comparing content representations of items to find the item most similar to items that interest the user. Since content-based filtering require items be associated with informative content descriptions, collaborative filtering has clear advantage in domains where such descriptions do not exist. In this paper, we focus on the collaborative filtering systems.

Numerous algorithms and domains for collaborative filtering have been proposed. The very first methods form a heuristic implementation of the “Word of Mouth” phenomenon [13].� A review and comparison of earlier works is given in [6]. More recently, Adomavicius and Tuzhilin [1] provided a comprehensive review of up-to-date methods and application domains. Most notable collaborative filtering

algorithms can be categorized as memory-based [13] and model-based [3,5] or a combination of them [15]. The model-based approach can be further divided into user-based and item-based depending on whether the similarities among users or items are used.

A fundamental challenge for collaborative filtering algorithm is data sparsity. In practice, most users do not provide ratings for most items and thus the user-item matrix is very sparse with many ratings left undefined. As a result, the accuracy of recommendation is often quite poor. To address this problem, a number of techniques have been proposed. Billsus and Pazzani [5] proposed to fill in the rating matrix with 0 and then reduce the data dimensionality by using Singular Value Decomposition (SVD). In [16], Zitnick and Kanade use maximum entropy to alleviate the effect of data sparsity. Other methods are to combine memory-based and model-based approaches [15], or combine collaborative and content-based filtering [19]. A recent work reports improved predicted accuracy on sparse data when exploiting both user similarity and item similarity [20].

We follow the modeling approach proposed in [3,5], which casts collaborative filtering as a classification problem and use machine learning algorithms to predict unknown ratings. Within this approach, the most natural way is to treat each user as an independent, separate classification problem, create one classifier for each user. Due to the symmetric nature of user and item, an alternative way is to treat each item as a separate classification problem. Despite its intuitiveness and simplicity, the accuracy of independent classifiers suffers when data get sparse. Another problem is that the system should maintain as many classifiers as the number of users.

In this paper, we propose to tackle the data sparsity problem by solving multiple classification problems for all users at the same time. In the machine learning literature, this approach is known as multi-task learning or transfer learning[7]. The rationale behind transfer learning is that learning multiple classifiers together allows transforming information among them and thus improves the overall accuracy while requires less training data. Yu and Tresp [21] have shown that when formulated as the low-rank matrix approximation problem, collaborative filtering has close connection with multi-task learning.

In this work, we use a boosting algorithm for all users at the same time. Information is transformed among users via shared features that the algorithm selects at each boosting rounds. This method was proposed in [14] for the face

227978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

recognition problem but the problem, objectives and the visual features used in their work are different from the ones we have at hand.

The rest of the paper is structured as follows. In section 2 we formulate collaborative filtering as a classification problem. In section 3 we give a brief review of a boosting algorithm for binary classification and then show how to modify this algorithm to deal with multiple users at the same time. Section 4 presents experimental evaluation of our method. We conclude the paper in section 5.

II. PROBLEM FORMULATION

In this section we formulate collaborative filtering as a classification problem. This formulation closely follows the one described by Billsus and Pazzani [5] and is given here for the self-completeness of the paper.

Let U = {u1, …, uK} be a set of K users, G = {g1, …, gN}be a set of N items. Each item can be a book, a movie or a web page. The users’ ratings with respect to the items are arranged in matrix R = { rij }i=1,..,K; j=1,..,N. The ratings can be collected either explicitly by asking users to give their judgments on items or implicitly from log data, e.g. a document is recorded as relevant for a user if the user had open the document. In general, rij can get a value from an ordered set of rating values. For simplicity, in this work, we assume that rij can take one of three values: +1 if user i labels item j as desired, −1 if user ilabels item j as undesired, or ∅ if no rating is given. The problem is to predict the ratings a particular user k – called active user - would give to unrated items.

A common way to solve the presented problem is to treat prediction for every user u as an independent classification problem. In this setting, every item g for which u has provided ratings is considered as a training example with the rating as its classification label and ratings given by other users as its features. For the specific user, we train the classifier and use learned rules to make predictions for items with unknown ratings. Overall, the method requires K independent classifiers.

Consider the example given in table 1 with four users and five items. Assume we want to predict the rating of user u4 for item g5. Since user u4 has given ratings for three items g1, g2,g3, this information can be use as training data and leads to three training examples g1, g2, and g3. The examples are presented by feature vectors where u1, u2, u3 are features and their ratings are feature values. The examples’ labels are ratings given by u4.

TABLE 1. EXAMPLE OF RATING MATRIX

g1 g2 g3 g4 g5u1 +1 −1 +1 +1 u2 +1 +1 −1u3 +1 +1 −1 −1u4 +1 −1 +1 ?

With the above formulation, we can use ratings data to train classifiers. Predictions for a specific user are then made by applying the learned classification rule to items with unknown ratings.�

III. COLLABORATIVE FILTERING BY BOOSTING

Within the context of collaborative filtering a key challenge is that the prediction quality largely depends on training data and can be poor when many ratings are not given. To make predictions more robust against data sparsity, in this section we present a method that solves multiple classification problems together. The rationale behind this is that it allows sharing information among related problems and thus alleviates the effect of data sparsity. The method presented here is inspired by a previous work on image classification [14], in which the authors proposed to use multiple classifiers with common features to recognize objects in images.

To train multiple classifiers and select common feature, we use a modified version of classifiers based on boosted decision stumps. Next we briefly review boosting for binary classification and then present modified boosting for multiple classification problems with application to collaborative filtering.

A. Boosted decision stumps for binary classification Boosting is a machine learning technique that learns a

classifier F(x) that is an additive combination of multiple weak classifiers of the form

==

M

m m xfxF1

)()( , (1)

where fm (x) is a weak classifier, x is the input vector, M is the number of boosting rounds (number of weak classifiers). In the literature, F(x) is called strong classifier as its accuracy is higher than that of each independent weak classifier.

There are a number of boosting algorithms in the literature [8,9,12]. Following [14], we chose gentleboost, a version of boosting first presented in [9], to base our algorithm on. The reasons to choose gentleboost are its simplicity, numerical robustness and reportedly good performance in comparison with other boosting methods [14].

The gentleboost algorithm can be described as follows. Given N training examples (x1, y1), …, (xN, yN), where xi, i = 1,..,N, is a feature vector and yi is the respective label: yi = +1 or −1 (“like” or “dislike” in the context of collaborative filtering). The classifier is as given in (1) and the (binary) prediction for input vector x is calculated as sign(F(x)). The learning process is performed through M rounds; in each round the boosting algorithm does two things: 1) it learns a weak classifier by choosing a feature that best classifies weighted training examples, and 2) it recalculate the examples’ weights so that misclassified examples get higher weights. The algorithm is summarized in figure 1.

At step (a) of each round, the algorithm chooses fm (x) to minimize the weighted squared error:

=−=

N

i iimi yxfwJ1

2))(( (2)

It is common to define the weak learners as decision stumps (decision trees with only one note that make decisions by checking the values of single features). A decision stump

228978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

selects a feature and assigns an output label to an example based on the feature value of the example. The decision stump can be defined as a function of the form:

fm (x) = aδ (x f > t) + bδ (x f ≤ t), (3)

where δ (e) = 1 if e is true and δ (e) = 0 otherwise, t is a threshold, a and b are parameters to be chosen when fitting the stump, x f is the value of f-th feature of x. In the considered settings, the ratings can only be +1 or −1 and thus we choose the threshold t to be zero t = 0.

1. Initialize wi = 1/N, i = 1..N, wi is the weight of i-th example. Initialize F (x) = 0

2. Repeat for m = 1, 2, …, Ma. Train fm (x) using weighted training examples b. Update F (x) ← F (x) + fm (x)c. Update weights )( imi xfy

ii eww −← and normalize so that they form a distribution

3. Return classifier ])([sign)]([sign1=

=M

m m xfxF

Figure 1. The gentleboost algorithm

It should be noted that the decision stump performs feature selection as it selects a single feature to most accurately classify the weighted training examples. We fit the decision stump by considering all possible features f. For each feature f,we can estimate parameters a and b by weighted least squares as follows:

>

>=

i

i

fi

fii

xw

xywa

)0(

)0(

δ

δ (4)

≤

≤=

i

i

fi

fii

xw

xywb

)0(

)0(

δ

δ (5)

The f and corresponding a, b that give the minimal value of error J are chosen to form the weak learner at the current round. The algorithm adds this weak learner to the previous ones (step (b)). Finally, at step (c), the algorithm updates the weights of examples as follows:

)( imi xfyii eww −← (6)

This update increases the weights of examples that are misclassified and decreases the weights of examples that are correctly classified in the current round and thus forces the algorithm to focus on misclassified examples at the next rounds.

B. Boosting for multiple classification problems at the same time The boosting algorithm described above can be applied to

predict the ratings for each user independently. In this section, we describe how to modify boosting to deal with multiple users at the same time. The basic idea here is that when users share similar interests they should have common features that we can use to make predictions for all of them together. Training multiple classifiers at the same time needs less training data since many classes share such common features and information is transformed among similar users. We will show how to modify boosting to use shared stumps and features.

Recall that for the formulation given in section 2, there are K classification problems. The k-th problem is given with training data (xk

1, yk1),…, (xk

N, ykN), where yk

j = rkj, and xkj = (r1j, …, r(k-1)j, r(k+1)j, …, rKj). Note that only columns with rkj ≠∅ can serve as training examples for k-th problem. However, for the ease of presentation, we include examples with rkj = ∅.During training phase, such examples have zero weights and thus do not affect training results.

The major modification to boosting is that at each round, the algorithm selects a feature and corresponding stump that give the lowest error for a subset of classification problems instead of a single classification problem.

Specifically, we modify the above boosting algorithm as follows. Each training example now has K weights wk

j, k = 1,…,K. wk

j = 0 if rkj = 0, i.e. item j is not a training example for classifier k. We sum up classification errors of all Kclassifiers to form the new error function:

==−=

N

ikii

km

ki

K

kyxfwJ

12

1))(( , (7)

where )(xf km denotes weak learner k at round m. Let S(t)

be a subset of classification problems. At round m, instead of choosing the best weak learner for each independent problem as before, the algorithm selects the best class subset and corresponding stump that have uses the same feature for all the subset’s problems. The best stump for each class subset is one that minimizes error (7). The stump has the following form:

∉

∈≤+>=

)(if

)(if)0()0(),(

tSkc

tSkxbxatxf

kS

fS

fSk

mδδ

(8)

The value of fm() depends on which subset S(t) is considered so we write fm() as a function of t. Thus, ),( txf k

m

denotes the weak learner at round m for problem k when subset S(t) of classification problem is considered. Because, the cost function (7) now depends on S(t) too, we rewrite (7) as function of t as follows:

==−=

N

i ik

mki

ki

K

ktxfywtJ

12

1)),(()( (9)

229978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

The decision stump (8) differentiates two cases: )(tSk ∈ and )(tSk ∉ . In the later case the purpose of constant

cSk is to prevent the classifier from choosing a class label just

due to the imbalance between positive and negative training examples. Therefore, in order to add a class to the shared subset, we need to have a decrease of the classification error that is larger than just using a constant as weak classifier. This insures that the shared features really provide additional information.

As noted in [14], when a stump is shared among several classes, the error for each shared class increases with respect to a stump optimized just for that class. However, because more classes have their classification error reduced when the stump is shared, the total multi-class error decreases.

Minimizing (9) gives the following parameter values:

,)0(

)0(

)(

)(

)(

∈

∈

>

>

=

tSk i

fi

ki

tSk i

fi

ki

ki

Sxw

xyw

faδ

δ

(10)

,)0(

)0(

)(

)(

)(

∈

∈

≤

≤

=

tSk i

fi

ki

tSk i

fi

ki

ki

Sxw

xyw

fbδ

δ

(11)

=

i

ki

i

ki

ki

kS

w

yw

c , k ∉ S(t) (12)

The modified boosting (denoted by MCBoost) is given in figure 2.

To find the best class subset S (t), we should enumerate all 2K-1 possible class subsets and find corresponding stumps, which is too expensive. To reduce the computational cost, we use greedy search over combinations of classes.

At each round, the algorithm starts by considering all subsets containing only single class and selects one with lowest error value. Then, the algorithm selects the next class that gives the best error reduction jointly with previously selected classes. We repeat adding classes until we have added all the classes. This procedure gives K nested subsets, from which we choose the one with the lowest error value. Since, at each step, we must consider adding one from the O(K) classes and there are K steps, the overall complexity of finding the best subset at each round is O ( K2).

C. Related workAs we mentioned above, our method uses the algorithm of

Torralba et al. [14], which was designed for classifying visual objects. However, the problem formulation and objectives in our case are different from those in their work. Specifically:

- In [14], the algorithm was proposed for classifying visual objects into multiple classes (multi-class classification) while in our work the problem is to solve multiple binary classification problems for multiple users. Thus, we need a different problem formulation when applying the modified boosting algorithm.

- The main objective of using modified boosting in [14] is to find and utilize features that visual objects have in common to reduce computational complexity. In the case of collaborative filtering, one of the main objectives is to find users with similar preferences when data are sparse. Indeed, finding subsets S(t) of users that have common interests (common features) is important because it can explain the prediction results.

1. Initialize wkj = 1 if rkj ≠ ∅ and wk

j = 0 if rkj = ∅, i = 1,.., N; k = 1,.., KInitialize Fk (x) = 0

2. Repeat for m = 1, …, Ma. Repeat for subsets S (t)

i. Fit decision stump

∉

∈≤+>=

)(if

)(if)0()0(),(

tSkc

tSkxbxatxf

k

fS

fSk

mδδ

,

using (10),(11),(12) ii. Compute error

==−=

N

ikii

km

ki

l

kytxfwtJ

12

1)),(()(

b. Find the best t : )(minarg* tJtt

=

c. Update Fk (x) ← Fk (x) + f km(x,t*)

d. Update weights ),( *txfyki

ki

imkieww −←

3. Return classifier sign [ Fk (x)]

Figure 2. The modified boosting algorithm with shared stumps

IV. EXPERIMENTAL EVALUATION

In this section we present the empirical evaluation of our method. We compared the method with two other methods on how accurate they can make predictions for two real datasets.

A. Evaluation settings For each dataset, we divided all the users into training set

Utr and test set Ut. Ratings given by the users from the training set were used as feature values for creating the models while the test users were used to score the prediction accuracy. We divided the ratings for each test user into observed set, and held-out set that we attempted to predict. The ratings from the observed set were input to predict the ratings from the held-out set.

Following experimental procedures reported in the literature, we used the mean absolute error (MAE) as the

230978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

evaluation metric. MAE gives the average absolute deviation of predicted ratings to the actual ratings for all items from the held-out set and all test users:

L

rr

MAE th UuGy

uy

uy

∈∈

−

= ,

ˆ

where Gh is the held-out set, L is number of all tested ratings.

B. Datasets We evaluated the algorithm on two datasets: EachMovie

[17] and MovieLens [18], which are common benchmarks for collaborative filtering.

EachMovie is a database of movie ratings collected by Systems Research Center of Digital Equipment Corporation. The dataset contains 2811983 ratings given by 72916 users for 1628 movies. User ratings were recorded on a numeric six-point scale (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). Following [5], we transformed the two highest scores (0.8 and 1.0) into +1 (like) and the rest into −1 (dislike). In other words, we give label +1 to a movie if the rating for the movie was 0.8 or 1.0, or −1otherwise.

The second dataset, MovieLens is provided by GroupLens – a research group at university of Minnesota. It contains 1000209 ratings from 6040 users for 3900 movies. Ratings are in five-point scale (1, 2, 3, 4, 5). We transformed two highest scores into +1 and the rest into −1 in a similar way as described above.

For EachMovie dataset we randomly selected 10000 users who had rated more than 20 movies. For MovieLens dataset we randomly selected 500 users with more than 20 ratings.

C. Comparison and results We compared the method (denoted by MCBoost) with two

other methods:

- The first one is the k-nearest neighbor using Pearson correlation as measure of similarity (denoted by KPC) – the most common memory-based collaborative filtering method in practice.

- The second one is the original boosting algorithm for independent users (denoted by GentleBoost).

For MovieLens dataset, 100, 200, 300 users were selected randomly for training and other 200 randomly selected users were used for testing. For EachMovie dataset , 1000, 2000, 6000 users were selected randomly for training and the rest 4000 users were used for testing.

To study the impact of data sparsity on prediction accuracy we varied the number of known ratings for each user to be 5, 10, 20 to simulate different degrees of data sparsity.

The MAE values for MovieLens and EachMovie datasets are given in table 2 and 3 respectively.

The results show that, both boosting methods are superior to KPC. The MAE values of the boosting are less than those of KPC for all training data and degrees of data sparsity.

When many ratings are known, GentleBoost gives slightly more accurate predictions than MCBoost. Specifically, when 20 ratings per user are known, GentleBoost gives better MAE in five out of six cases. A possible explanation is that GentleBoost selects features that give lower error value for individual classes when enough training data are provided.

As we expected, when data become sparser (there are 10 or less ratings per user), MC Boost makes more accurate predictions in most cases. This is due to favorable effect of sharing features among multiple classes.

TABLE 2. MAE VALUES FOR MOVIELENS DATASET

Number of known ratings Training set sizes

Method

5 10 20

KPC 0.378 0.337 0.328

GentleBoost 0.350 0.322 0.291 100 users

MC Boost 0.329 0.305 0.292

KPC 0.361 0.330 0.318


MC Boost 0.314 0.299 0.289

KPC 0.348 0.336 0.317


MC Boost 0.308 0.298 0.283

TABLE 3. MAE VALUES FOR EACHMOVIES DATASET

Number of known ratings Training set sizes

Method

5 10 20

KPC 0.559 0.474 0.449


MC Boost 0.492 0.460 0.429

KPC 0.528 0.450 0.422


MC Boost 0.484 0.419 0.393

KPC 0.521 0.437 0.378


MC Boost 0.452 0.397 0.365

231978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

V. CONCLUSION

We have proposed to use a transfer learning method for collaborative filtering. The method uses a modified version of boosting that learns classifiers for multiple users at the same time. Since the classifiers can share features that are common among users with similar interests, the method needs less training data and thus is able to deal with data sparsity – a fundamental problem of collaborative filtering. Experimental results show that our proposed method gives robust prediction accuracy even when data are sparse. For future work, we plan to apply other transfer learning methods to collaborative filtering and perform more comprehensive comparison with previous methods.

ACKNOWLEDGEMENTS

The first author was supported by VNPT and Posts & Telecommunications Institute of Technology, Vietnam. This work was supported by Ministry of Science and Technology of Vietnam under a grant for fundamental research.

REFERENCES

[1] G. Adomavicius, A. Tuzhilin. “Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions”. IEEE Transactions On Knowledge And Data Engineering, Vol. 17, No. 6, 2005.

[2] M. Balabanovic and Y. Shoham, “Fab: Content-Based, Collaborative Recommendation,” Communications of the ACM, vol. 40, no. 3, pp. 66-72, 1997.

[3] C. Basu; Hirsh, H.; Cohen, W. 1998. Recommendation as classification: Using social and content-based information in recommendation. In Proc. of 15 Nat. Conf. on Artificial Intelligence (AAAI-98), 714–720.

[4] J. A Baxter, Model for Inductive Bias Learning. J. of Artificial Intelligence Research, 2000.

[5] D. Billsus and M. Pazzani, “Learning Collaborative Information Filters,” Proc. Int’l Conf. Machine Learning, 1998.

[6] J. S. Breese, D. Heckerman, and C. Kadie. “Empirical analysis of

Predictive Algorithms for Collaborative Filtering”. In Proc. of 14th Conf. on Uncertainty in Artificial Intelligence, 1998. 43-52.

[7] R. Caruana. “Multi–task learning”. Machine Learning, 28, p. 41–75, 1997.

[8] Y. Freund and R. Schapire. “Experiments with a new boosting algorithm”. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148-156, 1996.

[9] J. Friedman, T. Hastie and R. Tibshirani. “Additive logistic regression: a statistical view of boosting”. The Annals of Statistics, 38(2):337-374, April, 2000.

[10] J. Herlocker; J. Konstan, A. Borchers, and J. Riedl. “An algorithmic framework for performing collaborative filtering”. In SIGIR ’99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 230–237. 1999.

[11] J. Herlocker, J. Konstan, L.G. Terveen, and J. Riedl, “Evaluating Collaborative Filtering Recommender Systems”. ACM Transactions on Information Systems, Vol. 22, No. 1, 5–53, 2004.

[12] R. Schapire, “The Boosting Approach to Machine Learning: An Overview,” Proc. MSRI Workshop Nonlinear Estimation and Classification, 2001.

[13] U. Shardanand and P. Maes, “Social Information Filtering: Algorithms for Automating ‘Word of Mouth’,” Proc. Conf. Human Factors in Computing Systems, 1995.

[14] A. Torralba, K.P. Murphy, and W. T. Freeman. Sharing Visual Features for Multiclass and Multiview Object Detection. IEEE Trans. On Pattern Analysis And Machine Intelligence, Vol. 29, No. 5, May 2007.

[15] G.-R. Xue, C. Lin, Q. Yang, W. Xi, H.-J. Zeng, Y. Yu, and Z. Chen. Scalable collaborative filtering using cluster-based smoothing. In Proc. of SIGIR, 2005.

[16] L. Zitnick, T. Kanade, “Maximum entropy for collaborative filtering”. Proceedings. of Int. Conference on Machine learning (ICML-04), 2004.

[17] http://www.reserch.compaq.com/SRC/eachmovie/ (this link is no more alive, the dataset was downloaded before the link had been disable)

[18] http://www.movielens.org [19] J. Balisico, T, Hofman, Unifying collaborative and content-based

filtering. In Proceedings. of Int. Conf. on Machine learning (ICML-04) 2004.

[20] J. Wang, A.P. de Vries, M.J.T. Reinders, Unifying user-based and item-based collaborative filtering approaches by similarity fusion. Proc. of SIGIR’06. Seatle, USA.

[21] K. Yu and V. Tresp. Learning to Learn and Collaborative Filtering. NIPS Workshop on Inductive transfer. 2005.

232978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

Documents

[IEEE 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies - Ho Chi Minh City, Vietnam (2008.07.13-2008.07.17)]