Movie topics- Efficient features for movie recommendation systems

Preview:

DESCRIPTION

User written movie reviews carry substantial amounts of movie related features such as description of location, time period, genres, characters, etc. Using natural language processing and topic modeling based techniques, it is possible to extract features from movie reviews and find movies with similar features.

Citation preview

Efficient Features for Movie Recommendation

Systems

Project presentation

Suvir Bhargav

Outline

● Motivation and Why movie reviews● Problem statement● How? or the overall system ● Text preprocessing approaches● Postprocessing: movie topics from a reviews

corpus● Similarity● Experimental setup and results

Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html

Motivation

Motivation

● movie genres are not enough.● classify movies

○ keywords○ moods○ imdb ratings○ micro genres

micro genres

source: http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/

Why movie reviews?

Source: a sample user written movie review from imdb

Problem statement

● Feature extraction from user reviews of movies

● Use extracted features to find similar movies.

The overall system

Movie reviews corpus● preprocessing

○ tokenization, stopwords, lemmatized.

● post processing○ topic modeling: Movie topics from a reviews corpus

● similarity measure○ return movies with similar topics distribution

tokenization, stopwords, lemmatized.

Simple information extraction

Text preprocessing

Figure credit to nltk book.

Post processing

Document representation: Vector Space Model (VSM)

Picture credit: pyevolve

Post processing: generative model

source: David blei’s slide

Post processing: LDA

For each document in the collection, the words can be generated in two stage process1) Randomly choose a distribution over topics.2) For each word in the document

a) Randomly choose a topic from the distribution over topics in step 1.

b) Randomly choose a word from the corresponding distribution over the vocabulary

Documents exhibit multiple topics

Movie topics from a reviews corpus

Similarity Measure

● Cosine Similarity● KL divergence● Hellinger distance

Cosine Similarity

Similarity Measure

Hellinger Distance

Similarity Measure

The overall system: implementation

Movie reviews corpus● preprocessing

○ nltk and gensim’s simple preprocessing.

● post processing○ gensim python wrapper to MALLET○ index topic distribution of query movies, q and 1k

movies corpus, C.

● similarity measure○ python numpy implementation○ apply distance metric on indexed q and C.○ sort and pick top 5 movies.

Experimental setup

Movie reviews corpus of 1k movies

reviews data source: imdb

Evaluation criteria

Experimental setup

Conclusion

● Movie topics as efficient features for RS○ represents movies by underlying semantic patterns

○ useful for capturing movie genre and mood.

○ but not so well with plot.

○ user written movie reviews are useful movie meta-data.

● The developed prototype○ easy to add more movie meta-data

○ python allows scalability.

○ Topics as an explanation needs further tuning.

Future directions

● Movie review preprocessing○ bigram, trigrams.○ create multi-word movie keywords or language

construction

● Building complex topic models○ Hierarchical LDA○ author-topic model

■ include authorship information.■ similarity between authors

Questions ?

Thank You

Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpaper-x.html

Extra slides

List of extra slides and notes● Original LDA paper● introduction to probabilistic topic modeling● and A. Huang’s Similarity measures for text document

clustering● Another good LDA description● Integrating out multinomial parameters in LDA● language construction in micro genres

LDA

Recommended