28
VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY ---------------------------------------- INTERNSHIP REPORT INFORMATION TECHNOLOGY TITLE : COLLABORATIVE FILTERING TECHNIQUES IN RECOMMENDER SYSTEMS Supervisor: Assoc. Prof. Dr. Hà Quang Thụy Student: Mai Công Đạt Student ID: 11020067 Group: K56CA (QH2011-CQ- CA)

Internship Report

Embed Size (px)

DESCRIPTION

Computer Science report

Citation preview

Page 1: Internship Report

Vietnam national university, hanoi

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

----------------------------------------

INTERNSHIP REPORT

TITLE: COLLABORATIVE FILTERING TECHNIQUES IN RECOMMENDER

SYSTEMS

Supervisor: Assoc. Prof. Dr. Hà Quang Thụy

Student: Mai Công ĐạtStudent ID: 11020067Group: K56CA (QH2011-CQ-CA)

Hanoi, October 2014

Page 2: Internship Report

TABLE OF CONTENTS1 INTRODUCTION.............................................................................................4

1.1 About Knowledge Technology Laboratory..................................................4

1.2 About Topic: Collaborative Filtering techniques in Recommender Systems ........................................................................................................................4

2 COLLABORATIVE FILTERING....................................................................4

2.1 Recommender Systems.................................................................................4

2.2 Collaborative Filtering..................................................................................5

2.2.1 Overview.............................................................................................5

2.2.2 Collaborative Filtering Process...........................................................6

2.3 Collaborative Filtering Algorithms...............................................................6

2.3.1 Cosine Similarity.................................................................................6

2.3.2 Pearson Correlation Similarity............................................................7

2.3.3 Singular Value Decomposition (SVD)................................................7

3 EXPERIMENT..................................................................................................8

3.1 Recommendation Engine : RecDB...............................................................8

3.2 Experiment....................................................................................................9

4 CONCLUSION AND FUTUREWORK.........................................................15

5 REFERENCES................................................................................................15

1

Page 3: Internship Report

ACKNOWLEDGMENTS

I would like to express my deep appreciation to Associate Professor, Doctor Ha Quang Thuy who are supervised and leaded me to complete the internship process.

I would like to give a big thank to brother and sister in Knowledge Technology Laboratory (KT-Lab) who are supported me to complete this report.

I would also like give gratitude to University of Engineering and Technology that are provided the environment and condition for my learning.

Because time is limited and the condition of this thesis is inevitable shortcomings, I look forward to the comments of the teacher and the concern you have with this issue.

2

Page 4: Internship Report

TABLE OF FIGURE

Figure 1. Collaborative Filtering Process..............................................................6Figure 2. Turn on database server........................................................................10Figure 3. Create and run database “movielensdb”...............................................11Figure 4. Import initmovielens1mdatabase.sql....................................................12Figure 5. Check list of relations...........................................................................12Figure 6. Top-10 movies recommendation based on the rating predicted using Item-Item Collaborative Filtering............................................................................14Figure 7. Recommends the top 5 action movies to user 1...................................15Figure 8. Recommends the top 5 action movies to user 2...................................16Figure 9. Recommends the top 5 action movies to user 3...................................17

3

Page 5: Internship Report

1 INTRODUCTION1.1 About Knowledge Technology Laboratory

Knowledge and Technology laboratory is under Faculty of Information Technology. There are some main fields in researching:

Text Mining, Web Mining, Opinion Mining, Social Media mining, and Natural Language Processing Vietnamese Entity/Object Search

Vietnamese Entity/Object Search Process Mining, Knowledge Technology and Service Science

The head of Knowledge and Technology laboratory is Associate Professor, Doctor Ha Quang Thuy.

1.2 About Topic: Collaborative Filtering techniques in Recommender SystemsRecommender Systems can be divided into two main categories, Content-

based systems and Collaborative Filtering systems [1] [2] [3]. In my internship course, I choose the Collaborative Filtering approach, there are some reasons:

Firstly, Collaborative Filtering is based on simple ideal, so it is easy to comprehend and implement.

Secondly, although Collaborative Filtering is simple but it is effect intuition, and using in widely, such as: Amazon.com, Yahoo, Cinemax.com …

Last, Collaborative Filtering is the basic method, it is proven about the performance, and it can be improved.

2 COLLABORATIVE FILTERING2.1 Recommender Systems

Recommender Systems are a subclass of Information Filtering system that use to predict the preference that user would give to an item [1] [4] (movies, books, music, news, Web page, images …).

Typically, Recommender Systems produce a list of recommendations in one of two ways: through Collaborative or Content-based Filtering [5] [1]

4

Page 6: Internship Report

[2]. Collaborative Filtering approaches constructing a model from user behavior to items in past then use that model to predict items (or rating for items) that user may have interest in [5] [2]. Content-based Filtering approaches uses the series of discrete characteristics of an item in order to recommend additional items with similar properties [2]. In the real system, these approaches are often combined. It is known as Hybrid Recommender Systems. The good example of hybrid systems is Netflix. In this report, I am focus on Collaborative Filtering.

2.2 Collaborative Filtering

2.2.1 OverviewCollaborative Filtering is a technique that automatically predicts the

interest of an active user by collecting rating information from other similar users or items. The underlying assumption of Collaborative Filtering is that the active user will prefer those items which the similar users prefer [6]. Collaborative Filtering can be divided into two approaches: Memory-based and Model-based [2].

The Memory-based approaches (It also known as Nearest Neighbor Collaborative Filtering or User-based approaches) [5] are the most popular prediction methods and are widely adopted in commercial Collaborative Filtering systems [7] [8]. This algorithm utilize the entire user-item database to generate a prediction, that mean, these systems employ statistical techniques to find a set of users, known as neighbors, that have a history of agreeing with the target user (i.e., they either rate different items similarly or they tend to buy similar sets of items) [5].

Model-based Collaborative Filtering algorithms (also known as Item-based approaches) provide item recommendation by first developing a model of user ratings. Algorithms in this category take a probabilistic approach and envision the Collaborative Filtering process as computing the expected value of a user prediction, given his/her ratings on other items [5]. The Model-based approaches are developed using data mining, machine learning algorithms to find patterns based on training data, in other words, training datasets are used to train a predefined model. Model-based

5

Page 7: Internship Report

approaches can be divided into some category: clustering model, aspect models, and the latent factor model [1].

There are a number of recommender system that uses both of memory and model base method. It make more effect for recommendation. Evidently, it is more complicate in implementation [9]. These system is called “Hybrid Recommender Systems”, for example: Recommender System of Google search

2.2.2 Collaborative Filtering Process

Figure 1. Collaborative Filtering Process

Figure 1 shows the process of the Collaborative Filtering. Collaborative Filtering algorithms represent the entire m ×n user-item data as a ratings matrix A. Each entry a i , j in A represent the preference score (ratings) of the i

th user on the jth item. Each individual ratings is within a numerical scale and it can as well be 0 indicating that the user has not yet rated that item.

There are many algorithm can be used for Collaborative Filtering. In this paper, I will focus on Cosine Similarity, Pearson Correlation Similarity, Singular Value Decomposition.

6

Page 8: Internship Report

2.3 Collaborative Filtering Algorithms

2.3.1 Cosine SimilarityCosine Similarity is a Model-based algorithm for making

recommendations [1]. In this algorithm, the similarities between different items (or users) in the dataset are calculated by using Cosine similarity, and then this similarity values are used to predict ratings for user-item pairs not present in the dataset. In this case, two items i , j are thought of as two vectors in the m dimensional user-space. The similarity between them is measured by computing the cosine of the angle between these two vectors.

¿ (i , j )=cos ( i⃗ , j⃗ )= i⃗ . j⃗‖i‖.‖ j‖

If the value of similarity is 1, two vectors are the same orientation, if that value is 0, two vector is crossed, item i and j are distinct. And if this value is -1, two is not similarity.

2.3.2 Pearson Correlation SimilarityPearson Correlation Similarity is a Model-based algorithm for making

recommendations [1]. In this case, the similarities between two item i , j is measured by computing Pearson Correlation corr i , j

¿ (i , j )=corr i , j=∑u∈U

(Ru , i−Ri)(Ru , j−R j)

√∑u∈U

( Ru , i−Ri )2 .√∑

u∈U( Ru , j−R j )

2

Where Ru , i denotes the rating of user u to item i, Ri is the average rating of the i-th item.

The value of ¿ (i , j ) will be between -1 and 1. Values 0, -1 or 1 are very rarely. That value is somewhere in between those values. The closer the value of r gets to zero, the greater the variation the data points are around the line of best fit

2.3.3 Singular Value Decomposition (SVD)Singular Value Decomposition is a matrix factorization technique

commonly used for producing low-rank approximations. Given an matrix Am× n, with rank r, the singular value decomposition, SVD (A ), is defined as

7

Page 9: Internship Report

SVD ( Am×n )=Um×m × Sm×n ×V Tn× n

Where matrix S diagonal matrix having only r nonzero entries, which makes the effective dimensions of these three matrices m ×r, r ×r, and r ×n, respectively. U and V are two orthogonal matrices and S is a diagonal matrix, called the singular matrix.

SVD has an important property that provides the best low-rank linear approximation of the original matrix A, called Ak . It is possible to retain only k ≪ r singular values by discarding other entries. Berry. M et al [10] and Scott. C et al [11] pointed out that the low-rank approximation of the original space is better than the original space itself due to the Filtering out of the small singular values that introduce “noise” in the customer-product relationship.

SVD produces a set of uncorrelated eigenvectors. Each customer and product is represented by its corresponding eigenvector. The process of dimensionality reduction may help customers who rated similar products to be mapped into the space spanned by the same eigenvectors.

3 EXPERIMENT3.1 Recommendation Engine : RecDB

In this section, I am doing some experiment using RecDB - Recommendation Engine Built Entirely Inside PostgreSQL 9.2 of Mohamed Sarwat of University of Minnesota. RecDB allows application developers to build recommendation applications in a heartbeat through a wide variety of built-in recommendation algorithms like user-user Collaborative Filtering, item-item Collaborative Filtering, singular value decomposition. Applications powered by RecDB can produce online and flexible personalized recommendations to end-users. This engine is free and available in the website http://www-users.cs.umn.edu/~sarwat/RecDB/.

RecDB has the following main features:

Usability: RecDB is an out-of-the-box tool for web and mobile developers to implement a myriad of recommendation applications. The system is easily used and configured so that a novice developer

8

Page 10: Internship Report

can define a variety of recommenders that fits the application needs in few lines of SQL

Seamless Database Integration: Crafted inside PostgreSQL database engine, RecDB is able to seamlessly integrate the recommendation functionality with traditional database operations, i.e., SELECT, PROJECT, JOIN, in the query pipeline to execute ad-hoc recommendation queries

Scalability and Performance: The system optimizes incoming recommendation queries (written in SQL) and hence provides near real-time personalized recommendation to a high number of end-users who expressed their opinions over a large pool of items

By author, RecDB is designed to be run on a Unix operating system. At least 1GB of RAM is recommended for most queries, though when working with very large data sets more RAM may be desirable, especially when you are not working with apriority (materialized) recommenders.

RecDB support 3 algorithms with 5 parameters:

ItemCosCF: Item-Item Collaborative Filtering using Cosine Similarity measure.

ItemPearCF: Item-Item Collaborative Filtering using Pearson Correlation Similarity measure.

UserCosCF: User-User Collaborative Filtering using Cosine Similarity measure.

UserPearCF: User-User Collaborative Filtering using Cosine Similarity measure.

SVD: Simon Funk Singular Value Decomposition.

To implementation and running RecDB, I have been prepared knowledge about Linux, PostgreSQL.

In this tool, the author supports two sample database: Movie data from Movielens, and Geography database. Because time is limited and the condition of this thesis, I will run my experiment in Movielens database that publish by the author. Now, I am running demo for RecDB.

9

Page 11: Internship Report

3.2 ExperimentStep 1. Turn on database server with command line from terminal in PosgreSQL folder: perl scripts/pgbackend.pl

If server is available, Terminal look like as follow picture:

Figure 2. Turn on database server

Step 2: Create and run new database has name “movielensdb” with command line in new terminal: perl scripts/pgfrontend.pl movielensdb

The address of the host server running the PostgreSQL backend is localhost (default)

10

Page 12: Internship Report

Figure 3. Create and run database “movielensdb”

Step 3: Import the already database into movielensdb:

\i initmovielens1mdatabase.sql;

When import success we have:

11

Page 13: Internship Report

Figure 4. Import initmovielens1mdatabase.sql

Step 4: Check list of table in Database after importing

We have 6 table in movielensdb: ml_items, ml_items_systemid_seq, ml_ratings, ml_ratings_ratingid_seq, ml_users, ml_users_systemid_seq

Figure 5. Check list of relations

Step 5: Create Recommenders:

CREATE RECOMMENDER MovieRec ON ml_ratings

USERS FROM userid

ITEMS FROM itemid

EVENTS FROM ratingval

USING ItemCosCF;

12

Page 14: Internship Report

In this step, I use recommender “MovieRec” using relation “ml_ratings”, ml_ratings(userid,itemid,ratingval) represents the ratings table in a movie recommendation application, with users from “userid” and items from “itemid”, and using “Item-Item Collaborative Filtering using Cosine Similarity measure”. If I change the parameter “ItemCosCF” to other parameter, I will user other algorithm, RecDB support 5 parameter: ItemCosCF, ItemPearCF, UserCosCF, UserPearCF, and SVD. Now, I call recommend top-10 movies based on the rating predicted using Item-Item Collaborative Filtering (applying cosine similarity measure) algorithm to user 1:

SELECT * FROM ml_ratings R

RECOMMEND R.itemid TO R.userid ON R.ratingval USING ItemCosCF

WHERE R.userid = 1

ORDER BY R.ratingval

LIMIT 10;

This is result:

13

Page 15: Internship Report

Figure 6. Top-10 movies recommendation based on the rating predicted using Item-Item Collaborative Filtering

Now, following query recommends the top 5 action movies to user 1:

SELECT r.itemid, i.name, i.genre, r.ratingval

FROM ml_ratings r, ml_items i

RECOMMEND r.itemid

TO r.userid

ON r.ratingval

USING itemcoscf

WHERE r.userid = 1 AND r.itemid = i.itemid AND i.genre ILIKE '%action%'

ORDER BY ratingval

14

Page 16: Internship Report

DESC LIMIT 5;

Figure 7. Recommends the top 5 action movies to user 1

As can be seen from the Figure 7, we can see five action movie with the highest rating value. That mean, the system can make recommendation for user 1 five movies in action type.

To compare, I will make top 5 action movies for user 2 and user 3. For user 2, I user query:

SELECT r.itemid, i.name, i.genre, r.ratingval

FROM ml_ratings r, ml_items i

RECOMMEND r.itemid

TO r.userid

ON r.ratingval

USING itemcoscf

15

Page 17: Internship Report

WHERE r.userid = 2 AND r.itemid = i.itemid AND i.genre ILIKE '%action%'

ORDER BY ratingval

DESC LIMIT 5;

And I have the result:

Figure 8. Recommends the top 5 action movies to user 2

And for user 3, I use query :

SELECT r.itemid, i.name, i.genre, r.ratingval

FROM ml_ratings r, ml_items i

RECOMMEND r.itemid

TO r.userid

ON r.ratingval

USING itemcoscf

16

Page 18: Internship Report

WHERE r.userid = 3 AND r.itemid = i.itemid AND i.genre ILIKE '%action%'

ORDER BY ratingval

DESC LIMIT 5;

And result:

Figure 9. Recommends the top 5 action movies to user 3

It can be seen from Figure 7, 8, 9 it is very clear that RecDB making recommend for three users is difference, for example, user 1 is recommended five movies : Master Ninja I (1984), Mirage (1995), Heaven's Burning (1997), Big Trees, The (1952), Tough and Deadly (1995). With user 2: Master Ninja I (1984), Johnny 100 Pesos (1993), African Queen, The (1951), Diva (1981), Godfather, The (1972). And for user 3: Master Ninja I (1984), Target (1995), Sea Wolves, The (1980), Born American (1986), Johnny 100 Pesos (1993). That happen is true, because difference user will interested in

17

Page 19: Internship Report

difference movies. However, we can see that Master Ninja I (1984) is recommended for three user. That mean, this film has the high estimation.

4 CONCLUSION AND FUTUREWORKIn this report, I introduce recommender systems, recommender systems

based on Collaborative Filtering techniques, and related techniques and tools. Collaborative Filtering Algorithms (Cosine Similarity, Pearson Correlation Similarity, and Singular Value Decomposition) had been described. Moreover, the tools for recommender systems RecDB had been introduced. Some experiments using the RecDB tools on the “movielensdb” (movie data from Movielens) had been described and the results had been showed.

In the future, I will learn RecDB in more deeply, to understand how it using Collaborative Filtering to make recommendation. Then I may be crawled data from facebook.com and using RecDB to make recommendation.

5 REFERENCES

[1] Joseph A. Konstan, John Riedl, "Recommender systems: from algorithms to user experience.," User Model. User-Adapt. Interact. , vol. 22, no. 1-2, pp. 101-123, 2012.

[2] Michael D. Ekstrand, John Riedl, Joseph A. Konstan, "Collaborative Filtering Recommender Systems," Foundations and Trends in Human-Computer Interaction, vol. 4, no. 2, pp. 175-243, 2011.

[3] Jiliang Tang, Jie Tang, and Huan Liu, "Recommendation in Social Media - Recent Advances and New Frontier," A tutorial at KDD, pp. 24-27, 2014.

[4] Francesco Ricci and Lior Rokach and Bracha Shapira, "Introduction to Recommender Systems Handbook," in Recommender Systems Handbook, 2011, pp. 1-35.

18

Page 20: Internship Report

[5] Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl, "Item-based collaborative filtering recommendation algorithms," WWW, pp. 285-295, 2001.

[6] Hao Ma, Irwin King, Michael R. Lyu, "Effective missing data prediction for collaborative filtering," SIGIR, pp. 39-46, 2007.

[7] Greg Linden, Brent Smith, Jeremy York, "Amazon.com Recommendations: Item-to-Item Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, 2003.

[8] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, John Riedl, "GroupLens: An Open Architecture for Collaborative Filtering of Netnews," CSCW, pp. 175-186, 1994.

[9] Mustansar Ali Ghazanfar, Adam Prügel-Bennett, Sándor Szedmák, "Kernel-Mapping Recommender system algorithms," Inf. Sci., pp. 81-104, 2012.

[10] Berry, M. W., Dumais, S. T., and O’Brian, G. W., "Using Linear Algebra for Intelligent Information Retrieval," SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.

[11] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, Richard A. Harshman, "Indexing by Latent Semantic Analysis," JASIS, vol. 41, no. 6, pp. 391-407, 1990.

19

Page 21: Internship Report

Comment:

……………………………………….………………………………………………………………….………………………………………………………………….………………………………………………………………….………………………………………………………………….………………………………………………………………….………………………………………………………………….………………………………………………………………….………………………………………………………………….………………………………………………………………….…………………………

Mark: ……. In words: …………

Hanoi, …./…../2014Lecturer

20