Movie topics- Efficient features for movie recommendation systems

Efficient Features for Movie Recommendation

Systems

Project presentation

Suvir Bhargav

Outline

● Motivation and Why movie reviews● Problem statement● How? or the overall system ● Text preprocessing approaches● Postprocessing: movie topics from a reviews

corpus● Similarity● Experimental setup and results

Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html

Motivation

Motivation

● movie genres are not enough.● classify movies

○ keywords○ moods○ imdb ratings○ micro genres

http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/

micro genres

source: http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/

Why movie reviews?

Source: a sample user written movie review from imdb

Problem statement

● Feature extraction from user reviews of movies

● Use extracted features to find similar movies.

The overall system

Movie reviews corpus● preprocessing

○ tokenization, stopwords, lemmatized.

● post processing○ topic modeling: Movie topics from a reviews corpus

● similarity measure○ return movies with similar topics distribution

tokenization, stopwords, lemmatized.

Simple information extraction

Text preprocessing

Figure credit to nltk book.

Post processing

Document representation: Vector Space Model (VSM)

Picture credit: pyevolve

http://pyevolve.sourceforge.net/wordpress/?p=2497

Post processing: generative model

source: David blei’s slide

Post processing: LDA

For each document in the collection, the words can be generated in two stage process1) Randomly choose a distribution over topics.2) For each word in the document

a) Randomly choose a topic from the distribution over topics in step 1.

b) Randomly choose a word from the corresponding distribution over the vocabulary

Documents exhibit multiple topics

Movie topics from a reviews corpus

Similarity Measure

● Cosine Similarity● KL divergence● Hellinger distance

Cosine Similarity

Similarity Measure

Hellinger Distance

Similarity Measure

The overall system: implementation

Movie reviews corpus● preprocessing

○ nltk and gensim’s simple preprocessing.

● post processing○ gensim python wrapper to MALLET○ index topic distribution of query movies, q and 1k

movies corpus, C.

● similarity measure○ python numpy implementation○ apply distance metric on indexed q and C.○ sort and pick top 5 movies.

Experimental setup

Movie reviews corpus of 1k movies

reviews data source: imdb

Evaluation criteria

Experimental setup

Conclusion

● Movie topics as efficient features for RS○ represents movies by underlying semantic patterns

○ useful for capturing movie genre and mood.

○ but not so well with plot.

○ user written movie reviews are useful movie meta-data.

● The developed prototype○ easy to add more movie meta-data

○ python allows scalability.

○ Topics as an explanation needs further tuning.

Future directions

● Movie review preprocessing○ bigram, trigrams.○ create multi-word movie keywords or language

construction

● Building complex topic models○ Hierarchical LDA○ author-topic model

■ include authorship information.■ similarity between authors

Questions ?

Thank You

Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpaper-x.html

Extra slides

List of extra slides and notes● Original LDA paper● introduction to probabilistic topic modeling● and A. Huang’s Similarity measures for text document

clustering● Another good LDA description● Integrating out multinomial parameters in LDA● language construction in micro genres

http://sumidiot.wordpress.com/2012/06/13/lda-from-scratch/

http://lingpipe.files.wordpress.com/2010/07/lda3.pdf

http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/2/

LDA

Technology

Movie topics- Efficient features for movie recommendation systems