Upload
nedaa
View
35
Download
7
Embed Size (px)
DESCRIPTION
A Content-Based Approach to Collaborative Filtering. Brandon Douthit-Wood CS 470 – Final Presentation. Collaborative Filtering. Method of automating word-of-mouth Large groups of users collaborate by rating products, services, news articles, etc. - PowerPoint PPT Presentation
Citation preview
A Content-Based Approach to Collaborative Filtering
Brandon Douthit-Wood
CS 470 – Final Presentation
Collaborative Filtering
• Method of automating word-of-mouth
• Large groups of users collaborate by rating products, services, news articles, etc.
• Analyze ratings data of the group to produce recommendations for individual users– Find users with similar tastes
Problems with Collaborative Filtering Methods
• Performance– Prohibitively large dataset
• Scalability– Will the solution scale to millions of users on
the Internet?
• Sparsity of data– User who has rated few items– Item with few ratings
Problems with Collaborative Filtering Methods
• Cannot compare users that have no common ratings
User 1 User 2
Billy Madison 4
Happy Gilmore 5
Mr. Deeds 4
50 First Dates 5
Big Daddy 4
(Ratings on a scale of 1-5)
A Content-Based Approach
• Build a feature list for each user based on content of items rated
• Compare users’ features to make recommendations
• Now we can find similarity between users with no common ratings
Data Source
• EachMovie Project– Compaq Systems Research Center
– Over 18 months collected 2,811,983 ratings for 1,628 movies from 72,916 users
– Ratings given on 1-5 scale
– Dataset split into 75% training, 25% testing
• Internet Movie Database (IMDb)– Huge database of movie information
• Actors, director, genre, plot description, etc.
Creating the Feature List
• Retrieve content information for each movie from IMDb dataset – create “bag of words”
• Throw out common words (i.e.: the, and, but)• Calculate frequency of remaining words, create movie’s
feature list– Frequencies weighted based on total number of terms
Goldeneye
satellite 2 destroy 2
xenia 3 london 2
thriller 2 villain 2
simon 4 revenge 2
Comparing Users
• Each user has positive and negative feature list– Combine feature lists of movies they have rated
• Compare user’s feature lists using Pearson Correlation Coefficient
• Users can be compared with no common ratings• Able to recommend items with few ratings• Users only need to rate a few items to receive
recommendations
Methods
• Three methods attempted to improve performance:– Clustering of users– Random groups of users– Compare users directly to items
User Clustering
• Simple algorithm, starting with first user:– Compare to existing clusters first
• If similarity is high, merge user into cluster
– Compare to each remaining user
– Stop if correlation is above threshold
– Once a similar user is found, create a new cluster from the two users
• Cluster has combined feature list of all its users
• Not as efficient as possible - O(n2)
User Clustering
• Once clusters are formed, we can predict ratings for each item– For each user, find their 10 nearest neighbors– Predicted rating is the average rating of item
from these neighbors
Selecting a Random Group
• Randomly select 5000 users as a (hopefully) representative sample
• As before, find a user’s 10 nearest neighbors from the random group– Predicted rating is the average rating of item
from these neighbors
• Much less work than clustering– How much accuracy (if any) will be lost?
Comparing Users to Items
• No collaborative filtering involved
• Compare the positive and negative feature lists of user to feature list of item– Make prediction based on which feature list has
higher correlation with item
• Pretty quick and easy to do– How accurate will this be?
Analyzing Predictions
• Collected 3 metrics to evaluate predictions– Accuracy: all items predicted correctly– Precision: positive items predicted correctly– Recall: unseen positive items predicted
correctly
• Precision and recall have inverse relationship
Results
84.868
61.27358.616
93.974
84.392 82.94187.424
64.15960.832
0
10
20
30
40
50
60
70
80
90
100
Cluster Group Random Group Content-based
Prediction Method
Per
cen
tag
e
Accuracy
Precision
Recall
Conclusions
• Large gain from clustering users– Is the extra work worth it?– Depends on the application
• Purely content-based predictions worked pretty well– Simple, fast solution
• Random group prediction also performed reasonably well• Problems solved by content-based analysis:
– Sparsity of data– Performance– Scalability