A Content-Based Approach to Collaborative Filtering

A Content-Based Approach to Collaborative Filtering

Brandon Douthit-Wood

CS 470 – Final Presentation

Collaborative Filtering

• Method of automating word-of-mouth

• Large groups of users collaborate by rating products, services, news articles, etc.

• Analyze ratings data of the group to produce recommendations for individual users– Find users with similar tastes

Problems with Collaborative Filtering Methods

• Performance– Prohibitively large dataset

• Scalability– Will the solution scale to millions of users on

the Internet?

• Sparsity of data– User who has rated few items– Item with few ratings

Problems with Collaborative Filtering Methods

• Cannot compare users that have no common ratings

User 1 User 2

Billy Madison 4

Happy Gilmore 5

Mr. Deeds 4

50 First Dates 5

Big Daddy 4

(Ratings on a scale of 1-5)

A Content-Based Approach

• Build a feature list for each user based on content of items rated

• Compare users’ features to make recommendations

• Now we can find similarity between users with no common ratings

Data Source

• EachMovie Project– Compaq Systems Research Center

– Over 18 months collected 2,811,983 ratings for 1,628 movies from 72,916 users

– Ratings given on 1-5 scale

– Dataset split into 75% training, 25% testing

• Internet Movie Database (IMDb)– Huge database of movie information

• Actors, director, genre, plot description, etc.

Creating the Feature List

• Retrieve content information for each movie from IMDb dataset – create “bag of words”

• Throw out common words (i.e.: the, and, but)• Calculate frequency of remaining words, create movie’s

feature list– Frequencies weighted based on total number of terms

Goldeneye

satellite 2 destroy 2

xenia 3 london 2

thriller 2 villain 2

simon 4 revenge 2

Comparing Users

• Each user has positive and negative feature list– Combine feature lists of movies they have rated

• Compare user’s feature lists using Pearson Correlation Coefficient

• Users can be compared with no common ratings• Able to recommend items with few ratings• Users only need to rate a few items to receive

recommendations

Methods

• Three methods attempted to improve performance:– Clustering of users– Random groups of users– Compare users directly to items

User Clustering

• Simple algorithm, starting with first user:– Compare to existing clusters first

• If similarity is high, merge user into cluster

– Compare to each remaining user

– Stop if correlation is above threshold

– Once a similar user is found, create a new cluster from the two users

• Cluster has combined feature list of all its users

• Not as efficient as possible - O(n2)

User Clustering

• Once clusters are formed, we can predict ratings for each item– For each user, find their 10 nearest neighbors– Predicted rating is the average rating of item

from these neighbors

Selecting a Random Group

• Randomly select 5000 users as a (hopefully) representative sample

• As before, find a user’s 10 nearest neighbors from the random group– Predicted rating is the average rating of item

from these neighbors

• Much less work than clustering– How much accuracy (if any) will be lost?

Comparing Users to Items

• No collaborative filtering involved

• Compare the positive and negative feature lists of user to feature list of item– Make prediction based on which feature list has

higher correlation with item

• Pretty quick and easy to do– How accurate will this be?

Analyzing Predictions

• Collected 3 metrics to evaluate predictions– Accuracy: all items predicted correctly– Precision: positive items predicted correctly– Recall: unseen positive items predicted

correctly

• Precision and recall have inverse relationship

Results

84.868

61.27358.616

93.974

84.392 82.94187.424

64.15960.832

0

10

20

30

40

50

60

70

80

90

100

Cluster Group Random Group Content-based

Prediction Method

Per

cen

tag

e

Accuracy

Precision

Recall

Conclusions

• Large gain from clustering users– Is the extra work worth it?– Depends on the application

• Purely content-based predictions worked pretty well– Simple, fast solution

• Random group prediction also performed reasonably well• Problems solved by content-based analysis:

– Sparsity of data– Performance– Scalability

Documents

A Content-Based Approach to Collaborative Filtering