Application of Topic Segmentation in Audiovisual Information Retrieval

Application of Topic Segmentation in Audiovisual Information Retrieval

Petra Galuščáková[email protected]

Information Retrieval

● Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) [Manning, 21]

● Audiovisual Information Retrieval- Documents to retrieve in audiovisual format- Harder navigation

● Dependency on segmentation- We want to minimize user`s needed work and retrieve exact start point

- Especially audio and audiovisual data-> we need precise segmentation

- Eskevich [6] states significantly better results of IR with textTiling segmentation algorithm used then with c99 segmentation algorithm

Topic Segmentation

● Segment● Coherent part of data● Definition depends on the application – i. e. news

story, paragraphs in text● Hierarchical/linear structure

● Audiovisual recordings ● No given text structure● Needs to be segmented on sentences first

Topic Segmentation in Text

● Automatic Speech Recognition for transformation of audio track into text

● Errors in transcripts could influence segmentation

● Malioutov et al.[20] shows differences in evaluation of segmentation algorithms in dependency of manual and automatic transcripts

● Hsueh and Moore [12] shows that despite the word recognition error (WER equal to 39.1%) - their segmentation systems did not work significantly worse on ASR transcripts than on reference transcripts.

– ASR system is likely to mis-recognize different occurences of words in the same way

– Use more features than ASR output and the impact of recognition errors could be reduced

Systems for Topic Segmentation

● Lexical Cohesion Based● TextTilling [10], C99 [3], LCSeg [8], MinCut [19], Dotplot [16], IClustSeg [26],

TextLec [29], DivSeg [31], NM09[24], U00 [33], JSeg [1], Transeg [17], LCP [15], LSITilling [A9], TopSeg [11]

● Features Based● [12], [7] PLSA [14] – Decisoin trees[25, 32], Maximum Entropy, SVM [14]

● Generative Models● HMM [13, 32], BayesSeg, U00 [4, 33]

Lexical Cohesion● Cohesion

- The sentences "stick together" to function as a whole [23]- Achieved through back-reference, conjunction, and semantic word relations

● Division according to Halliday and Hasan [9]:

● Reiteration:

– Reiteration with identity of reference:1. Mary bit into a peach. 2. Unfortunately the peach wasn't ripe.

– Reiteration without identity of reference:1. Mary ate some peaches. 2. She likes peaches very much.

– Reiteration by means of superordinate (subdominate, and synonyms):1. Mary ate a peach. 2. She likes fruit.

● Collocation:

– Systematic semantic relation (systematically classifiable):1. Mary likes green apples. 2. She does not like red ones.

– Nonsystematic semantic relation (not systematically classifiable):1. Mary spent three hours in the garden yesterday. 2. She was digging potatoes.

Systems for Topic Segmentation - C99

● C99 [3]

● Based on the cosine measure of sentence pairs

– Similarity between sentences x and y, fi,j denotes frequency of word j in sentence I

– Similarity values are used to build the similarity matrix [17]

– Then the ranked matrix is built according to the similarity matrix ● Each value in the similarity matrix is replaced by its rank in the local

region. The rank is the number of neighbouring elements with a lower similarity value [3]

– Finally clustering is applicated● Iteratively searching for maximum density of matrices in the rank matrix

Systems for Topic Segmentation - TextTiling

● Based on a lexical repetition

● Uses cosine measure

● A window of fixed length is being gradually slid through the text, and information about word overlap between the left and right part of the window is converted into digital signal.[10]

● Graph is then smoothed

● Shape of the post-processed signal is used to determine segment breaks.

● High similarity values, implying that the adjacent blocks cohere well, tend to form peaks, whereas low similarity values, indicating a potential boundary between tiles, create valleys. [10]

Systems for Topic Segmentation – Features Based

● Text

● Lexical features- Cue words and n-grams (now, okay, let’s, um, so, good night, ...) [12, 28]- Distribution of nouns [7]

● Contextual Features:- Dialogue act type [12]- Speaker role (e.g., project manager, marketing expert)- Tense, aspect [24]

● Vocabulary- Word groups (months, day, coutry names, named entities, ...)- POS tags- Pronoun (Does the sentence contain a pronoun?), Numbers (segment of a specific length), Is this sentence part of a conversation, i.e. does this sentence contain “direct speech”? [12]- Interlocutors mention agenda items (e.g., presentation, meeting) or content words more often when initiating a new discussion. [12]


● Text● According to Hsueh [12] interlocutors do the following more often than usual at

segment boundaries: start speaking before they are ready, give information, elicit an assessment of what has been said so far, or act to smooth social functioning and make the group happier

● Lexical Chains [2, 14]- Does the word appear in the next few sentences? - Does the word appear in the next few words? - Does the word appear in the previous few sentences? - Does the word appear in the previous few words? - Does the word appear in the previous few sentences but not in the next few sentences? - Does the word begin the preceding sentence?


Audio:

● Conversational Features [12]- Amount of overlapping speech - Speaker activity change [24]

● Prosodic Features [12]- Fundamental frequency F0 – maximum, mean F0, patterns across the boundary [32]- Energy, energy at multiple points (e.g., the first and last 100 and 200 ms, the first and last quarter, the first and second half) - Pitch contour (relative to the speaker’s baseline [32]) – pitch is less robust [30]- Rate of speech (number of words and the number of syllables spoken per second)- Silence [1]- Duration of pauses [30], vowels [1], final vowels and final rhymes [32]

Segmentation Using Audio Information

● Segment is likely to start with higher pitched sounds and a lower rate of speech

● Tendency of speakers to reset pitch at the start of a new major unit - final fall in pitch associated with the ends of such units [30]

● Slowing down toward the ends of units [30]

● Topic shifts often occur after a pause of relatively long duration [12]


● Video:

● Color similarity

– Based on histogram● Motion similarity

– Pixel comparison

– Especially frontal shots, hand movements [12]

– Gestural features (eye gaze behaviour) [5], face similarity● Bag of Visual Words

● Interlocutors do not move around a lot when a new discussion is brought up [12]


● Hearst [11] creates new features as a combination of another features

● He shows that the most useful features are the anchor face and pauses

● According to Hsueh [12] must be lexical features combined with other features, in particular, conversational features (i.e., lexical cohesion, overlap, pause, speaker change)

Fusion

● Llinas [18] defines fusion as an information process that associates, correlates and combines data and information from single or multiple sensors or sources to achieve refined estimates of parameters, characteristics, events and behaviors

● From many sources of information and context, how to make our best to “interpret” the data [22]

● Levels of fusion

● Early fusion strategy- All modalities are „concatenated into one“- Only one decision is taken over the concatenated input

● Intermediate fusion strategy- I.e. creataing various feature vectors, which are finally processed by HMM

● Late fusion strategy- Each source is processed individually by a specific recognizer

Our Approach - Objectives

● Segmentes should be further porcessed by IR system

● Usable on several systems – MediaEval Competition Data and Dialogy corpus

● Applicable to various types of recordings news data and dialogs

● Language independent – should work at least with English and Czech data

● Small amount of training data for given type of recordings

● Training data exists for other type of recordings (i. e. TDT corpus – available in LDC, Malach)

● Possible to integrate users feedback (in Dialogy corpus)

Our Approach - Solution

● Should be feature based – one of the future could be output of cohesion based algorithm (TextTiling)

● Should incorporate all types of information (textual, audio and visual)

● Should use fusion for mixing these different sources

● In visual track - shot detection should be used

● Active learning could help to incorporate user feedback

References● [1] Katarina Bartkova: How far can prosodic cues help in word segmentation? In Proceedings of the 3rd International Conference on

Speech Prosody SP2006, 2006

● [2] Doug Beeferman, Adam Berger, John Lafferty: Statistical models for text segmentation, Journal Machine Learning - Special issue on natural language learning archive Volume 34 Issue 1-3, Feb. 1999, Pages 177 – 210, 1999

● [3] Freddy Y. Y. Choi : Advances in domain independent linear text segmentation, Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33, 2000

[4] Jacob Eisenstein, Regina Barzilay: Bayesian Unsupervised Topic Segmentation, Proceeding EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing, Pages 334-343, 2008

● [5] Jacob Eisenstein, Regina Barzilay, All Davis: Gestural Cohesion for Topic Segmentation, ACL 2008: 852-860, 2008

● [6] Maria Eskevich, Gareth J. F. Jones: DCU at MediaEval 2011: Rich Speech Retrieval. MediaEval 2011

● [7] Martin Franz , Bhuvana Ramabhadran , Todd Ward , Michael Picheny: Automated Transcription and Topic Segmentation of Large Spoken Archives, In Proceedings of Eurospeech, 2003

● [8] Michel Galley , Kathleen Mckeown : Discourse Segmentation of Multi-Party Conversation, in 41st Annual Meeting of ACL, 2003

● [9] M. A. K. Halliday, Ruqaiya Hasa: Cohesion in English, 1976

● [10] Marti A. Hearst TextTiling: A Quantitative Approach to Discourse Segmentation, Technical Report, 1993

● [11] Winston Hsu, Shih-fu Chang, Chih-wei Huang, Lyndon Kennedy Ching-yung Lin, Giridharan Iyengar: Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation, In IS&T/SPIE Electronic Imagin, 2004

● [12] Pei-yun Hsueh, Johanna D. Moore: Combining Multiple Knowledge Sources for Dialogue Segmentation in Multimedia Archives. ACL 2007, 2007.

References● [13] Minwoo Jeong, Ivan Titov:Multi-document Topic Segmentation, Proceeding CIKM '10 Proceedings of the 19th ACM international

conference on Information and knowledge management, Pages 1119-1128, 2010

● [14] David Kaucha, Francine Chen: Feature-Based Segmentation of Narrative Documents, Proceeding FeatureEng '05 Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing, Pages 32-39, 2005

● [15] Hideki Kozima: Text Segmentation Based On Similarity Between Words, Proceeding ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics, Pages 286-288, 1993

● [16] Niraj Kumar, Piyush Rai, Chandrika Pulla and C.V. Jawahar Video Scene Segmentation with a Semantic Similarity Proceedings of 5th Indian International Conference on Artificial Intelligence (IICAI 2011),14-16 December, 2011, Bangalore, India, 2011.

● [17] Alexandre Labadié, Violaine Prince: Lexical and semantic methods in inner text topic segmentation: A comparison between c99 and Transeg, Proceeding NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems, Pages 347 – 349, 2008

● [18] James Llinas, Christopher Bowman, Galina Rogova, Alan Steinberg, and Frank White: Revisiting the JDL Data Fusion Model II, In P. Svensson and J. Schubert Eds., Proceedings of the Seventh International Conference on Information Fusion FUSION 2004, 2004

● [19] Igor Malioutov, Regina Barzilay: Minimum Cut Model for Spoken Lecture Segmentation, Proceeding ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Pages 25-32, 2006

● [20] Igor Malioutov, Alex Park, Regina Barzilay, James Glass : Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input, In Proceedings, ACL, 2007

● [21] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: Introduction to Information Retrieval, 2008

● [22] Stéphane Marchand-Maillet: Multimedia Information Retrieval, Promise Witer School, 2012

● [23] Jane Morris, Graeme Hirst: Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structureof Text, Computational Linguistics Volume 17 Issue 1, March 1991, Pages 21-48, 1991

References● [24] John Niekrasz, Johanna Moore: Participant Subjectivity and Involvement as a Basis for Discourse Segmentation, Proceeding

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Pages 54-61, 2009

● [25] Rebecca J. Passonneau, Diane J. Litman: Discourse Segmentation by Human and Automated Means, Journal Computational Linguistics Volume 23 Issue 1, March 1997, Pages 103-139, 1997

● [26] Raúl Abella Pérez, José Eladio Medina Pagola: An Incremental Text Segmentation by Clustering Cohesion, Proceeding CIARP'10 Proceedings of the 15th Iberoamerican congress conference on Progress in pattern recognition, image analysis, computer vision, and applications, Pages 261-268, 2010

● [27] Lev Pevzner, Marti A. Hearst: A Critique and Improvement of an Evaluation Metric for Text Segmentation, Journal Computational Linguistics, Volume 28 Issue 1, March 2002, Pages 19-36, 2002

● [28] Jay M. Ponte , W. Bruce Croft : Text Segmentation by Topic, In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, 1997

● [29] Laritza Hernández Rojas, José E. Medina Pagola: A Novel Method of Segmentation by Topic Using Lower Windows and Lexical Cohesion, Proceeding CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications Pages 724-733, 2007

● [30] Elizabeth Shriber, Andreas Stolcke, Dilek Hakkani-Tür, Gükhan Tür: Prosody-Based Automatic Segmentation of Speech into Sentences and Topics, Journal Speech Communication - Special issue on accessing information in spoken audio archive Volume 32 Issue 1-2, Sept. 2000, Pages 127 – 154, 2000

[31] Fei Song, William M. Darling, Adnan Duric, Fred W. Kroon: An Iterative Approach to Text Segmentation, Proceeding ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval, Pages 629-640, 2011

● [32] Gökhan Tür, Andreas Stolcke, Dilek H. Tür, Elizabeth Shriberg: Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation, Comput. Linguist., Vol. 27, No. 1. pp. 31-57, 2001

● [33] Masao Utiyama , Hitoshi Isahara: A Statistical Model for Domain-Independent Text Segmentation, In Proceedings of the 9 th Conference of the European Chapter of the Association for Computational Linguistics, 2001

Thank you