Upload
suvir-bhargav
View
357
Download
1
Tags:
Embed Size (px)
DESCRIPTION
User written movie reviews carry substantial amounts of movie related features such as description of location, time period, genres, characters, etc. Using natural language processing and topic modeling based techniques, it is possible to extract features from movie reviews and find movies with similar features.
Citation preview
Efficient Features for Movie Recommendation
Systems
Project presentation
Suvir Bhargav
Outline
● Motivation and Why movie reviews● Problem statement● How? or the overall system ● Text preprocessing approaches● Postprocessing: movie topics from a reviews
corpus● Similarity● Experimental setup and results
Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html
Motivation
Motivation
● movie genres are not enough.● classify movies
○ keywords○ moods○ imdb ratings○ micro genres
micro genres
source: http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/
Why movie reviews?
Source: a sample user written movie review from imdb
Problem statement
● Feature extraction from user reviews of movies
● Use extracted features to find similar movies.
The overall system
Movie reviews corpus● preprocessing
○ tokenization, stopwords, lemmatized.
● post processing○ topic modeling: Movie topics from a reviews corpus
● similarity measure○ return movies with similar topics distribution
tokenization, stopwords, lemmatized.
Simple information extraction
Text preprocessing
Figure credit to nltk book.
Post processing
Document representation: Vector Space Model (VSM)
Picture credit: pyevolve
Post processing: generative model
source: David blei’s slide
Post processing: LDA
For each document in the collection, the words can be generated in two stage process1) Randomly choose a distribution over topics.2) For each word in the document
a) Randomly choose a topic from the distribution over topics in step 1.
b) Randomly choose a word from the corresponding distribution over the vocabulary
Documents exhibit multiple topics
Movie topics from a reviews corpus
Similarity Measure
● Cosine Similarity● KL divergence● Hellinger distance
Cosine Similarity
Similarity Measure
Hellinger Distance
Similarity Measure
The overall system: implementation
Movie reviews corpus● preprocessing
○ nltk and gensim’s simple preprocessing.
● post processing○ gensim python wrapper to MALLET○ index topic distribution of query movies, q and 1k
movies corpus, C.
● similarity measure○ python numpy implementation○ apply distance metric on indexed q and C.○ sort and pick top 5 movies.
Experimental setup
Movie reviews corpus of 1k movies
reviews data source: imdb
Evaluation criteria
Experimental setup
Conclusion
● Movie topics as efficient features for RS○ represents movies by underlying semantic patterns
○ useful for capturing movie genre and mood.
○ but not so well with plot.
○ user written movie reviews are useful movie meta-data.
● The developed prototype○ easy to add more movie meta-data
○ python allows scalability.
○ Topics as an explanation needs further tuning.
Future directions
● Movie review preprocessing○ bigram, trigrams.○ create multi-word movie keywords or language
construction
● Building complex topic models○ Hierarchical LDA○ author-topic model
■ include authorship information.■ similarity between authors
Questions ?
Thank You
Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpaper-x.html
Extra slides
List of extra slides and notes● Original LDA paper● introduction to probabilistic topic modeling● and A. Huang’s Similarity measures for text document
clustering● Another good LDA description● Integrating out multinomial parameters in LDA● language construction in micro genres
LDA