Upload
dogan-kaya
View
279
Download
0
Embed Size (px)
Citation preview
Information Extraction from Information Extraction from Movie SubtitlesMovie Subtitles
Extraction of information from subtitles to index moviesExtraction of information from subtitles to index movies
Dogan Kaya [email protected]
CS 578
Natural Language Processing
Graduate Course
Computer Engineering
Bilkent University – Turkey
Proposed System in one Sentence Proposed System in one Sentence
A platform for movie indexing via subtitle analysis
OutlineOutline
Introduction
Video Categorization Method
WordNet Domains
Conclusions - Future Work
IntroductionIntroduction Multimedia databases are becoming popularMost video classification methods are based on
visual/audio signal processingText processing is more lightweight than
visual/audio processingHigh-level semantics are more closely related to
human language than to visual features Subtitles capture the semantics of the
corresponding video
Text Pre-processingText Pre-processing
Subtitles are segmented into sentencesA Part of Speech Tagger is applied to each
sentence (Stanford Log-linear Part-Of-Speech
Tagger)Stop words removed based on a stop
words list
KeywordKeyword ExtractionExtraction
TextRank algorithm to extract keywordsTextRank :
represents the text as a graph,A ranking algorithm based on Google’s PageRanksorts vertices in decreasing rank order,extracts the top highly ranked vertices for further processing
TextRank Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain, July 2004
WordNetWordNet“WordNet is a semantic lexicon for the
English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.“(en.wikipedia.org)
WordNet RelationsWordNet Relations
hypernyms: Y is a hypernym of X if every X is a (kind of) Y (canine is a hypernym of dog)
hyponyms: Y is a hyponym of X if every Y is a (kind of) X (dog is a hyponym of canine)
coordinate terms: Y is a coordinate term of X if X and Y share a hypernym (wolf is a coordinate term of dog, and dog is a coordinate term of wolf)
holonym: Y is a holonym of X if X is a part of Y (building is a holonym of window)
meronym: Y is a meronym of X if Y is a part of X (window is a meronym of building)
Word Sense DisambiguationWord Sense DisambiguationWords have many possible meanings,
called sensesA Word Sense Disambiguation (WSD)
algorithm is needed to determine the correct sense of each word
WSDis based on the lexical database WordNet
WSD Banerjee, S., Pedersen, T.: An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In the Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLING-02) Mexico City, Mexico (2002)
WordNet Domains ExtractionWordNet Domains Extraction
Augment WordNet with domain labels
A taxonomy of ~200 domain labels
Each Synset annotated at least one domain label
WordNet domains
http://wndomains.itc.it/wordnetdomains.html WN domains:
Example WordNet DomainExample WordNet Domain
WordNet Domains Extraction IIWordNet Domains Extraction IIFor each video:
Extract the WordNet domains for each keyword’s sense
Calculate the frequency occurrence of each domain label
Sort domain labels in decreasing order according to their occurrence frequency
Correspondences between Correspondences between categories & WN domainscategories & WN domains
For each category label:
Look up in WordNet the senses related to it (include senses related through hypernym & hyponym relations)
Obtain the corresponding WordNet domains
Calculate the occurrence score for each domain
Sort domains in decreasing occurrence order
medicine, biology, mathematicsscience
military, historywar
animals, biology, entomologyanimals
WordNet domains Category
Category label assignmentCategory label assignmentCompare the ordered list with the WN domains of
each video with the ordered list of the WN domains of each category
medicine, biology, mathematics
science
military, historywar
animals, biology, entomologyanimals
WordNet domains Category
Example:
animals, entomology, biology
WN domains of a videoanimals
biology, mathematics, physics
science
Conclusions & Future WorkConclusions & Future WorkConclusions
An approach that is based only on text and uses natural language processing techniquesNo training phase (unsupervised approach)
WordNet Domain mapping
Future WorkDefinition of domain knowledge more close to movie classification (mpeg-7)
Improved WSD
Thank you!Thank you!
Questions & Comments
http://doganberktas.com