Upload
wendy-beryl-burns
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Tell us the first step you would do to comprehend the below passage?
Slumdog Millionaire, the latest megahit flip, talks about rags to riches story a slum dweller. The movie, an adaptation of novel, is based on popular Indian version of American soap contest – who wants to be millionaire –which was well accepted by the Masses.
Freida Pinto is heroin of the movie. She hails from Mumbai. Even though it was her debut movie, because of her exemplary performance she has received offers for many Hollywood movie.
Slumdog received numerous accolades from all over the world. Apart from the Oscar, some notables where – Toronto International festival, Cannes etc.
CS 626 - Group 1 Dept of CSE -IIT Bombay 1
Discourse Segmentation
CS 626 Course Seminar Dept of CSE,IIT B
Group-1:Sriraj (08305034)Dipak(08305901)Balamurali(08405401)
The way we go….
•Introduction
•Motivation
•TextTiling
•Context Vector and Segmentation
•Lexical Chains and Segmentation
•Segmentation with LSA
•Conclusion
•Reference
INTRODUCTION
Discourse comes from Latin word 'discursus' Discourse: A continuous stretch of (especially spoken) language larger than a sentence, often constituting a coherent unit such as a sermon, argument, joke, or narrative -(Crystal 1992) Discourse: Novels, as well as short conversations or groans(cries)
CS 626 - Group 1 4Dept of CSE -IIT Bombay
Beaugrande definition of discourse
• Cohesion - grammatical relationship between parts of a sentence essential for its interpretation;
• Coherence - the order of statements relates one another by sense.
• Intentionality - the message has to be conveyed deliberately and consciously;
• Acceptability - indicates that the communicative product needs to be satisfactory in that the audience approves it;
• Informativeness - some new information has to be included in the discourse;
• Situationality - circumstances in which the remark is made are important;
• Intertextuality - reference to the world outside the text or the interpreters' schemata;
CS 626 - Group 1 Dept of CSE -IIT Bombay 5
DISCOURSE STRUCTURE - SALIENT FEATURES
•Existence of a Hierarchy.
•Segmentation at semantic level.
•Domain-specific knowledge
CS 626 - Group 1 6Dept of CSE -IIT Bombay
DISCOURSE SEGMENTATION
"Partition of full length text into coherent multi-paragraph units " - Marti Hearst
CS 626 - Group 1 7Dept of CSE -IIT Bombay
MOTIVATION
• Text Summarization
• Question and Answering • Sentiment Analysis • Topic Detection
CS 626 - Group 1 8Dept of CSE -IIT Bombay
TEXTILING
Use of TF-IDF concept within a document
Analogy IR : Document->Entire Corpus NLP : Block-> Entire Document
A term used more inside a block weighs more.
Adjuscent Blocks contain more related terms - - An evidence of strong cohesion
CS 626 - Group 1 9Dept of CSE -IIT Bombay
CONTD...
Algorithm -
• Divide Text into blocks(say k sentence long).
• Compute cosine similarity with adjacent blocks.
cos(b1,b2) = •Smoothed Interpolated similarity v/s sentence gap number is plotted.
•Lowermost portion of valleys - Boundaries
CS 626 - Group 1 10Dept of CSE -IIT Bombay
n
tbt
n
tbt
n
tbtbt
ww
ww
1
2
2,1
2
1,
12,1,
TEXTTILING - WHAT WENT WRONG??
Same word need not be repeated - But similar word could be
WSD was not performed - Polysemy issues.
The contextual information not considered.
CS 626 - Group 1 12Dept of CSE -IIT Bombay
CONTEXT VECTORS & SEGMENTATION
Capture contextual information in different blocks.
Steps:• Encoding of contextual information. - context vector creation • Creation of Block Vectors • Measurement of similarity. -instead of TF-IDF, use context vector
• cos(v,w) =
CS 626 - Group 1 13Dept of CSE -IIT Bombay
n
tt
n
tt
n
ttt
wv
wv
1
2
1
2
1
DID IT DO THE TRICK?
Yes!
•Precision increased 32 to 52%
•Recall increased 40 to 51%
Lets try to improvise a bit more!
CS 626 - Group 1 14Dept of CSE -IIT Bombay
LEXICAL CHAINS
•A lexical cohesion computing Technique.
•A sequence of related words in the text. •Independent of the grammatical structure.
•Provides a context for disambiguation.
•Enable identification of the concept.
CS 626 - Group 1 15Dept of CSE -IIT Bombay
Different forms of Lexical Cohesion
• Repetition• Repetition through synonymy
– Police, officers
• Word Association through– Specialization/Generalization
• murder weapon, knife
– part-whole/whole-part relationship
• Committee, members
• Statistical association between words– Osama Bin laden and Word Trade center
How
• Uses an auxiliary resource to cluster words into sets of related concepts (wordnet)
• Areas of low cohesive strength are good indicators of topic boundaries
• Process– Tokeniser
– Lexical chainer
– Boundary detector
Process
• Tokenizer– POS tagging is done
– Morphological analysis is done
• Lexical Chainer– To find relation between tokens
– Single pass clustering
– First token is start of first chain
– Tokens added to most recently updated chain that it
shares the strongest relationship
Process Contd...
• Boundary Detection– A high concentration of chains begin and end between two adjacent textual
units
– Boundary Strength w(n,n+1) = E * S
• E = number of lexical chains whose span ends at sentence n
• S = number of chains that begin their span at sentence n+1
– Take the mean of all non zero scores
– This mean acts as minimum allowable boundary strength
And the Improvement is …
• Evaluation Metrics– Precision
– Recall
Precision Recall
SeLeCT 36.6 62.7
JTextTile 13.3 19.7
Random Segmentation 7.1 7.1
Problems with Frequency Vector Based Similarity
Short Passages• Similarity estimate is inaccurate for short passages• An additional occurrence of a common word (reflected
in numerator) causes a disproportionate increase in sim(x,y) unless the denominator is large
j
2j,y
j
2j,x
jj,yj,i
ff
ff
)y,x(sim
Problems with Frequency Vector Based similarity..cont’d(2)
Term Matching Problem
• Car; Automobile• Car; Petrol• Similar/related but distinct words are considered negative
evidence• Solutions
– Stemming– Thesaurus, Wordnet based similarity measures– Latent Semantic Analysis
Introduction to LSA
• LSA stems from work in IR
• Represents word and passage meaning as high-dimensional vectors in the semantic space
• Does not use humanly constructed dictionaries, knowledge bases, semantic networks, etc.
• Meaning of word : Average of the meaning of all passages in which it appears
• Meaning of passage: Average of the meaning of all the words it contains
The values are scaled according to general form of inverse document frequency
Dimensionality reduction using SVD
Training LSA …cont’d(2)
TVUB
mr mn rr rn
k
is k – dimensional LSA space for LSA feature vector for word wi
Benefits of applying SVD is concise representation. Storage and complexity of the
similarity matrix is reduced Captures major structural associations between
words and documents Noise is removed simply by omitting the less salient
dimensions in U
Training LSA …cont’d(3)
k )i(k
k
Applying LSA
• A sentence si is represented by its term frequency vector fi where fij
is the frequency of term j in si
• Meaning of si j
kiji )j(f
k
2jk
k
2ik
kjkik
jiij ),cos(M
Significance of k
• Finding optimal dimensionality: Important step in LSA
• Hypothetically, the optimal space for the reconstruction has the same dimensionality as the source that generates discourse.
• Source generate passages by choosing words from a k-dimensional space in such a way that words in
the same paragraph tend to be selected from
nearby locations.
LSA results
• LSA is twice as accurate as the word similarity based co-occurrence vector.
(Error reduced from 22% to 11 %)
• LSA values become less accurate as more dimensions are incorporated into the feature vectors
Conclusion
• Text tiling, context vector based similarity, lexical chaining and LSA all are bag of word approaches.
• Bag-of-word approaches are sufficient .. to some extent. “LSA makes no use of word order, thus of syntactic relations or logic, or of morphology. Remarkably, it
manages to extract reflections of passage and word meanings quite well without these aids, but it must still be suspected of resulting incompleteness or likely error on some occasions” Excerpt from [5].
1
Contd..
• LSA is purely statistical whereas other approaches use some form of external knowledge bases in addition to statistical techniques.
• Role of external Knowledge.
• To move to next level we need some linguistics.
• We need right mix of statistical and linguistics approaches to move forward.
CS 626 - Group 1 Dept of CSE -IIT Bombay 32
Reference
[1]. Hearst, M. A. 1993 Texttiling: a Quantitative Approach to Discourse segmentation. Technical Report. UMI Order Number: S2K-93-24., University of California at Berkeley.
[2]. Kaufmann, S. 1999. Cohesion and collocation: using context vectors in text segmentation. In Proceedings of the 37th Annual Meeting of the Association For Computational Linguistics on Computational Linguistics , Pages 99-107
[3]. Landauer, T. K., Foltz, P. W., & Laham, D. 1998. Introduction to Latent Semantic Analysis. Discourse Processes, 25, pages 259-284.
[4]. Barzilay, Regina and Michael Elhadad. 1997.Using lexical chains for text summarization. In Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS-97), Madrid, Spain
[4]. Freddy Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. In Proceedings of NAACL , pages 26-33
[5]. Freddy Y. Y. Choi, Peter Wiemer-hastings, Johanna Moore. 2001. Latent semantic analysis for text segmentation. In Proceedings of EMNLP, pages 109-117
[6]. Stokes, N., Carthy, J., Smeaton, A.F. 2002. Segmenting Broadcast News Streams Using Lexical Chains. in Proceedings of 1st Starting AI Researchers Symposium (STAIRS 2002), volume 1, pp.145-154.
CS 626 - Group 1 Dept of CSE -IIT Bombay 33