18
Information Extraction from Movie Information Extraction from Movie Subtitles Subtitles Extraction of information from subtitles to index movies Extraction of information from subtitles to index movies Dogan Kaya Berktas [email protected] CS 578 Natural Language Processing Graduate Course Computer Engineering Bilkent University – Turkey

Movie Categorization According to Subtitles -- NLP Course Project

Embed Size (px)

Citation preview

Page 1: Movie Categorization According to Subtitles -- NLP Course Project

Information Extraction from Information Extraction from Movie SubtitlesMovie Subtitles

Extraction of information from subtitles to index moviesExtraction of information from subtitles to index movies

Dogan Kaya [email protected]

CS 578

Natural Language Processing

Graduate Course

Computer Engineering

Bilkent University – Turkey

Page 2: Movie Categorization According to Subtitles -- NLP Course Project

Proposed System in one Sentence Proposed System in one Sentence

A platform for movie indexing via subtitle analysis

Page 3: Movie Categorization According to Subtitles -- NLP Course Project

OutlineOutline

Introduction

Video Categorization Method

WordNet Domains

Conclusions - Future Work

Page 4: Movie Categorization According to Subtitles -- NLP Course Project

IntroductionIntroduction Multimedia databases are becoming popularMost video classification methods are based on

visual/audio signal processingText processing is more lightweight than

visual/audio processingHigh-level semantics are more closely related to

human language than to visual features Subtitles capture the semantics of the

corresponding video

Page 5: Movie Categorization According to Subtitles -- NLP Course Project

Text Pre-processingText Pre-processing

Subtitles are segmented into sentencesA Part of Speech Tagger is applied to each

sentence (Stanford Log-linear Part-Of-Speech

Tagger)Stop words removed based on a stop

words list

Page 6: Movie Categorization According to Subtitles -- NLP Course Project

KeywordKeyword ExtractionExtraction

TextRank algorithm to extract keywordsTextRank :

represents the text as a graph,A ranking algorithm based on Google’s PageRanksorts vertices in decreasing rank order,extracts the top highly ranked vertices for further processing

TextRank Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain, July 2004

Page 7: Movie Categorization According to Subtitles -- NLP Course Project

WordNetWordNet“WordNet is a semantic lexicon for the

English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.“(en.wikipedia.org)

Page 8: Movie Categorization According to Subtitles -- NLP Course Project

WordNet RelationsWordNet Relations

hypernyms: Y is a hypernym of X if every X is a (kind of) Y (canine is a hypernym of dog)

hyponyms: Y is a hyponym of X if every Y is a (kind of) X (dog is a hyponym of canine)

coordinate terms: Y is a coordinate term of X if X and Y share a hypernym (wolf is a coordinate term of dog, and dog is a coordinate term of wolf)

holonym: Y is a holonym of X if X is a part of Y (building is a holonym of window)

meronym: Y is a meronym of X if Y is a part of X (window is a meronym of building)

Page 9: Movie Categorization According to Subtitles -- NLP Course Project

Word Sense DisambiguationWord Sense DisambiguationWords have many possible meanings,

called sensesA Word Sense Disambiguation (WSD)

algorithm is needed to determine the correct sense of each word

WSDis based on the lexical database WordNet

WSD Banerjee, S., Pedersen, T.: An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In the Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLING-02) Mexico City, Mexico (2002)

Page 10: Movie Categorization According to Subtitles -- NLP Course Project

WordNet Domains ExtractionWordNet Domains Extraction

Augment WordNet with domain labels

A taxonomy of ~200 domain labels

Each Synset annotated at least one domain label

WordNet domains

http://wndomains.itc.it/wordnetdomains.html WN domains:

Page 11: Movie Categorization According to Subtitles -- NLP Course Project

Example WordNet DomainExample WordNet Domain

Page 12: Movie Categorization According to Subtitles -- NLP Course Project

WordNet Domains Extraction IIWordNet Domains Extraction IIFor each video:

Extract the WordNet domains for each keyword’s sense

Calculate the frequency occurrence of each domain label

Sort domain labels in decreasing order according to their occurrence frequency

Page 13: Movie Categorization According to Subtitles -- NLP Course Project

Correspondences between Correspondences between categories & WN domainscategories & WN domains

For each category label:

Look up in WordNet the senses related to it (include senses related through hypernym & hyponym relations)

Obtain the corresponding WordNet domains

Calculate the occurrence score for each domain

Sort domains in decreasing occurrence order

Page 14: Movie Categorization According to Subtitles -- NLP Course Project

medicine, biology, mathematicsscience

military, historywar

animals, biology, entomologyanimals

WordNet domains Category

Page 15: Movie Categorization According to Subtitles -- NLP Course Project

Category label assignmentCategory label assignmentCompare the ordered list with the WN domains of

each video with the ordered list of the WN domains of each category

medicine, biology, mathematics

science

military, historywar

animals, biology, entomologyanimals

WordNet domains Category

Example:

animals, entomology, biology

WN domains of a videoanimals

biology, mathematics, physics

science

Page 16: Movie Categorization According to Subtitles -- NLP Course Project
Page 17: Movie Categorization According to Subtitles -- NLP Course Project

Conclusions & Future WorkConclusions & Future WorkConclusions

An approach that is based only on text and uses natural language processing techniquesNo training phase (unsupervised approach)

WordNet Domain mapping

Future WorkDefinition of domain knowledge more close to movie classification (mpeg-7)

Improved WSD

Page 18: Movie Categorization According to Subtitles -- NLP Course Project

Thank you!Thank you!

Questions & Comments

http://doganberktas.com