43
Experiments with Annotating Discourse Relations in the Hindi Discourse Relation Bank (HDRB) Umangi Oza†, Rashmi Prasad‡ Sudheer Kolachina†, Suman Meena§ Dipti Misra Sharma†, Aravind Joshi‡ † Language Technologies Research Center International Institute of Information Technology, Hyderabad, India § Center for Language, Literature and Cultural studies Jawaharlal Nehru University, NewDelhi, India ‡ Institute for Research in Cognitive Science/ Computer and Information Science University of Pennsylvania, Philadelphia, PA, USA

Experiments with Annotating Discourse Relations in the Hindi Discourse Relation Bank (HDRB) Umangi Oza†, Rashmi Prasad‡ Sudheer Kolachina†, Suman Meena§

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

  • Slide 1
  • Experiments with Annotating Discourse Relations in the Hindi Discourse Relation Bank (HDRB) Umangi Oza, Rashmi Prasad Sudheer Kolachina, Suman Meena Dipti Misra Sharma, Aravind Joshi Language Technologies Research Center International Institute of Information Technology, Hyderabad, India Center for Language, Literature and Cultural studies Jawaharlal Nehru University, NewDelhi, India Institute for Research in Cognitive Science/ Computer and Information Science University of Pennsylvania, Philadelphia, PA, USA
  • Slide 2
  • December 2009, ICONHDRB, Umangi et al.2 Introduction: Why Discourse? For many NLP applications, such as Question-Answering, Text Summarization, and Language Generation, sentence- level analysis derived from an annotated corpus is insufcient, e.g., Penn Treebank (PTB) (Marcus et al., 1993) Propbank (Palmer et al., 2005) Need for discourse-level information Penn Discourse Treebank (Prasad et al.,2008) Annotation over the same WSJ raw corpus as PTB and Propbank has resulted in an enriched annotated resource The browser allows the viewing of both annotations
  • Slide 3
  • December 2009, ICONHDRB, Umangi et al.3 Penn Discourse TreeBank (PDTB) (Prasad et al., 2008): Large-scale corpus of lexically-grounded annotations of discourse relations between abstract objects (AOs) Discourse Relations: cause, contrast, elaboration, etc. Abstract Objects: eventualities and propositions (Asher, 1993) Discourse relation triggers: Explicit connectives closed-class expressions from well-defined grammatical classes Alternative Lexicalizations (AltLex) Expressions not definable as explicit connectives Implicit connectives: inferred relations for which connectives are inserted
  • Slide 4
  • December 2009, ICONHDRB, Umangi et al.4 PDTB When no discourse relation can be inferred: EntRel (an entity-based coherence relation) NoRel (no discourse relation) Abstract object arguments of a discourse relation: called Arg1 and Arg2 Arg2 goes with the clause/AO in which the connective occurs Minimality Principle: Select only as much as the argument text span as is minimally necessary to interpret the relation Sense Annotation: Each relation assigned a sense label based on a hierarchical sense classification scheme
  • Slide 5
  • December 2009, ICONHDRB, Umangi et al.5 PDTB examples Convention: Arg1 [] and Arg2 {} Explicit Connective: [By most measures, the nations industrial sector is now growing very slowly]. Factory payrolls fell in September. So did the Federal Reserve Boards industrial production index. Yet, {many economists arent predicting that the economy is about to slip into recession.} (sense: Concession) AltLex: [Under a post-1987 crash reform, the Chicago Mercantile Exchange wouldnt permit the December S&P futures to fall further than 12 points for an hour.] {That caused a brief period of panic selling of stocks on the Big Board.} (sense: Result) Implicit Conective: [The voters, as well as numerous Latin American and East European countries that hope to adopt the Spanish model, are supporting the direction Spain is taking.] IMPLICIT=SO {It would be sad for Mr. Gonzalez to abandon them to appease his foes.} (sense: Result)
  • Slide 6
  • December 2009, ICONHDRB, Umangi et al.6 Hindi Discourse Relation Bank (HDRB) HDRB aims at creating a large-scale annotated corpus of discourse relations in Hindi texts, following the PDTB approach. Corpus: 200 K size corpus drawn from 400 K on which Hindi syntactic dependency annotation being conducted independently (Begum et al., 2008) Multi-domain newspaper corpus Other cross-linguistic discourse annotation projects: Chinese (Xue, 2005) Czech (Mladova et al., 2008) Turkish (Zeyrek and Webber, 2008)
  • Slide 7
  • December 2009, ICONHDRB, Umangi et al.7 Syntactic Classes of Explicit Connectives Explicit Connectives are closed-class expressions drawn from a set of well-defined grammatical classes. Subordinating conjunctions Coordinating conjunctions Adverbials Pied-piped sentential relativizers Subordinators Particles
  • Slide 8
  • December 2009, ICONHDRB, Umangi et al.8 Subordinating Conjunctions Lexical items conjoining nite adverbial clauses to their matrix clause Typically occur clause-initially Both single (e.g., (because)) and paired forms (e.g., ... (if..then)) [ ] { .} (Cause) [Today the lamp has been lit] because {it is my birthday}. [ ] { } (Conditional) If [one were to ask you to quit taking salt] then {even you would not quit}.
  • Slide 9
  • December 2009, ICONHDRB, Umangi et al.9 Coordinating Conjunctions Lexical items conjoining clauses or phrases of the same syntactic status Occur clause-initially, e.g., (and), (but) Single as well as paired forms e.g., (not only...but also) [ ] { . } (Concession) [There are many groups in the Sangh] but {there is just one ideology.}
  • Slide 10
  • December 2009, ICONHDRB, Umangi et al.10 Adverbials Adverbial and prepositional phrases claimed to function as anaphoric discourse (Webber et al., 2003) Some examples of these are (so), (then), (otherwise), (in fact), (just then), (in addition to this) etc. [ .] { .} (Expansion) [The coastal vegetation on the west coast of the Andaman has been completely destroyed due to wild waves]. In addition, {the coral reefs have also been damaged}.
  • Slide 11
  • December 2009, ICONHDRB, Umangi et al.11 Pied-piped Sentential Relativizers Pied-piped relative phrases that conjoin a relative clause with the predication of its matrix clause (rather than some NP) Examples are (so that), (because of which) [ ] { .} (Cause) [Dropping all his work, he picked up the bird and ran towards the dispensary] so that {it could be given proper treatment} The relative pronoun modifies the event expressed in the matrix clause
  • Slide 12
  • December 2009, ICONHDRB, Umangi et al.12 Subordinators Post-positions, verbal participles, and suffixes that introduce non-nite clauses { } [ .] (Succession) After {Baa left} [he called the boy to him]. ... { } [ .] (Synchronous) ...while [playing] {he forgets that if his friend too didnt let him touch his toy, then he would feel very bad too].
  • Slide 13
  • December 2009, ICONHDRB, Umangi et al.13 Particles Particles such as , can function as discourse connectives in Hindi [ .] { } { .} (Conjunction) [People see this as a consequence of the improving relation between the two countries]. {The Kashmiris are} also {learning a political lesson from this}. Instances only where they indicate the inclusion of verbs taken as discourse connectives . He didnt eat anything.
  • Slide 14
  • Arguments of Discourse Relations In PDTB, Arg2 is the argument syntactically associated with the connective, and Arg1 is the other argument. In HDRB, argument naming is based on the sense of the relation. Each relation definition specifies its own convention for argument naming. E.g., In the cause relation, one argument is the cause and the other is the effect. HDRB convention: Arg1=effect; Arg2=cause Advantages of semantic naming scheme: More meaningful, and simplifies the sense classification hierarchy. December 2009, ICONHDRB, Umangi et al.14
  • Slide 15
  • Arguments of Discourse Relations Cause after effect. Hence, Arg1-Arg2 [ ] { .} (Cause) After the competition, Sonal said that [when her name was announced as the winner, she could not believe herself for some time], because {she was thinking that the competition was xed}. Cause before effect (Arg2-Arg1) . { } [ .] (Cause) Fashion designers say that the most prevalent thefts or copies are of monopoly designs. {Designers know this fact very well} so [it does not matter to them many times]. December 2009, ICONHDRB, Umangi et al.15
  • Slide 16
  • December 2009, ICONHDRB, Umangi et al.16 Implicit Discourse Relations For adjacent sentences not related by an explicit connective, four possibilities are considered in order: (1)Infer a discourse relation and insert an implicit connective between them { .} IMPLICIT = [ .] (Causal) {All the players in this game are greater than even Sachin Tendulkar} so [it is not possible for anyone to get them clean bowled.] (2) If relation is inferred but insertion of connective leads to redundancy, find and annotate an alternate Lexicalization (AltLex) of the relation { } AltLex [ ] {Bangladeshs judiciary has seen an improvement}. That is why [India has decided to participate in the conference.]
  • Slide 17
  • Other Relations (3)If no discourse relation is inferred but coherence results from an entity-based relation, annotate relation as EntRel. [ ] EntRel { .} [Prakash Jhas latest lm Apaharan will be premiered at the lm festival.] {This is Jhas second lm on a different subject after Gangajal.} (4) If no discourse relation or EntRel is perceived, annotate relation as NoRel December 2009, ICONHDRB, Umangi et al.17
  • Slide 18
  • December 2009, ICONHDRB, Umangi et al.18 HDRB Sense Classification Adapted from PDTB
  • Slide 19
  • December 2009, ICONHDRB, Umangi et al.19 Results Annotated 35 texts from the HDRB corpus Total of 602 relations annotated (both explicit and implicit) Overall distribution of relation types Comparison with PDTB distributions (Prasad et al., 2008)
  • Slide 20
  • December 2009, ICONHDRB, Umangi et al.20 Types and Tokens of Discourse Relations Lexical strategies employed equally often as morphological marking Design difference between the two projects (implicit relations annotated between all adjacent sentences in HDRB unlike PDTB) probably the reason for the different relative proportions of explicit and implicit relations across the two corpora The higher proportion of AltLex compared to PDTB suggests that Hindi makes greater use of cohesive strategies to link with the prior discourse
  • Slide 21
  • December 2009, ICONHDRB, Umangi et al.21 Senses of Discourse Relations Sense distributions are similar cross-linguistically Chances of Expansion and Contingency relations being explicit lower compared to Comparison and Temporal relations
  • Slide 22
  • December 2009, ICONHDRB, Umangi et al.22 Additional Exploration of Discourse Adverbials Discourse adverbials are argued to be anaphoric (Webber et al., 2003) so that their arguments may be harder to identify than other types of connectives Investigated the disributions of two discourse adverbials to explore the extent of difficulty in resolving their arguments Contrastive adverbial, (nevertheless) Conjunctive adverbial, (in addition) Observations had non-adjacent LHArgs in 16% of cases always took adjacent arguments Thus, despite their anaphoric properties, some discourse adverbials seem to be more constrained than others, and therefore easier to resolve
  • Slide 23
  • December 2009, ICONHDRB, Umangi et al.23 Summary Adapting the PDTB scheme to Hindi discourse annotation led to Identication of syntactic categories for explicit connectives that appear to be more frequent than English (Particles, Sentential Relatives) More meaningful and simplified sense classification hierarchy Some of our observations from the initial annotations The correlation between the use of cohesive strategies and morphological richness of a language is not completely settled by our annotations so far. Perhaps, study of languages with morphology richer than Hindi may shed further light on this issue Sense distributions in both PDTB and HDRB were similar and conrm the lack of expectation of cross-linguistic semantic differences Annotation of discourse adverbials show further evidence of the locality of arguments which can significantly benefit anaphora resolution for connectives
  • Slide 24
  • December 2009, ICONHDRB, Umangi et al.24 Thank you
  • Slide 25
  • Questions? December 2009, ICONHDRB, Umangi et al.25
  • Slide 26
  • December 2009, ICONHDRB, Umangi et al.26 Back-up slides
  • Slide 27
  • December 2009, ICONHDRB, Umangi et al.27 Subordinators (11) { } [ .] Upon {hearing Baas words}, [Gandhiji felt very ashamed]. Some instances of subordinators are not discourse connectives, such as when they denote the manner of an action (Ex.12) (12) . [Lit.] He caught Baas hand and took her to the door by dragging her. Preliminary annotation experiments suggest that distinguishing the discourse and non-discourse usage of subordinators is a difcult task Annotate them in a later phase of the project
  • Slide 28
  • December 2009, ICONHDRB, Umangi et al.28 Arguments of Discourse Relations In PDTB, assignment of Arg1/Arg2 labels syntactically driven In HDRB, it is semantically driven i.e, based on the sense of the relation to which the argument belong In examples 15 and 16, both relations have the sense cause. Cause sense definition: one cause and one effect In 15, cause after effect. Hence, Arg1-Arg2 (15) [ ] { .} After the competition, Sonal said that [when her name was announced as the winner, she could not believe herself for some time], because {she was thinking that the competition was xed}.
  • Slide 29
  • December 2009, ICONHDRB, Umangi et al.29 Arguments of discourse relations In 16, cause before effect (Arg2-Arg1) (16) . { } [ .] Fashion designers say that the most prevalent thefts or copies are of monopoly designs. {Designers know this fact very well} so [it does not matter to them many times]. According to the PDTB convention, both would have been Arg1- Arg2 (syntactic argument order) Semantics-based convention has the added advantage of simplifying the Sense classication scheme
  • Slide 30
  • December 2009, ICONHDRB, Umangi et al.30 Implicit Discourse Relations If a relation can be inferred between sentences, an implicit connective is inserted Insertable connectives drawn primarily from the list of explicit connectives, but can include others too (17) { .} IMPLICIT = [ .] {All the players in this game are greater than even Sachin Tendulkar} so [it is not possible for anyone to get them clean bowled.] In this example, an implicit connective expressing a causal relation is inserted
  • Slide 31
  • December 2009, ICONHDRB, Umangi et al.31 Implicit Discourse Relations If a discourse relation can be inferred but insertion of a connective leads to redundancy in the expression of the relation, it suggests that the second sentence of the pair contains an alternatively lexicalized non-connective expression: AltLex AltLex not a closed class element (18) { } AltLex [ ] {Bangladeshs judiciary has seen an improvement}. That is why [India has decided to participate in the conference.]
  • Slide 32
  • December 2009, ICONHDRB, Umangi et al.32 Implicit Discourse Relations If no discourse relation can be inferred, Identify an entity-based relation (EntRel) across the two sentences The second sentence provides further description about an entity (or entities) from the previous sentence (19) [ ] EntRel { .} [Prakash Jhas latest lm Apaharan will be premiered at the lm festival.] {This is Jhas second lm on a different subject after Gangajal.} Only purpose of the second sentence to provide additional information about Jhas second film If neither a discourse relation nor an EntRel, then NoRel (no relation)
  • Slide 33
  • December 2009, ICONHDRB, Umangi et al.33 Points of Departure from PDTB Sense Scheme Elimination of senses due to syntactic argument-naming conventions (as per HDRB semantic naming convention for arguments) Restricted back-offs in the sense hierarchy (PDTB allowed back-offs upto the top level. But HDRB belief is that top level senses are too coarse-grained to be useful. Thus, back-offs allowed only to the second level) Uniform treatment of pragmatic relations into more refined senses (Epistemic, speech-act, propositional (Sweetser, 1990)) Addition of the Goal sense (Was included as a causal relation in PDTB)
  • Slide 34
  • December 2009, ICONHDRB, Umangi et al.34 Elimination of argument-specic labels While some of the subtype distinctions do represent the arguments relative semantic roles; others continue to be renements of the relations semantics Furthermore, it was observed that the distinctions expressed by the subtype labels were related to the variation in the linear order of the arguments e.g., reason and result subtypes under Contingency.cause, reason (cause after effect) and result (cause before effect) In HDRB, the assignment of argument labels is semantically driven and therefore, these subtype labels are eliminated to avoid inconsistencies
  • Slide 35
  • December 2009, ICONHDRB, Umangi et al.35 Restricted back-offs In PDTB, annotators were allowed to backoff to higher levels in the hierarchy when they found it difcult to identify the more rened senses at the lower levels For example, Comparison at the class level instead of Comparison.Contrast or Comparison.Concession in case of ambiguity between Contrast and Concession In HDRB, such backoffs restricted only upto the type level Senses are too coarse-grained to be useful Guideline consistent with the fact that argument ordering specifications are provided at the type level
  • Slide 36
  • December 2009, ICONHDRB, Umangi et al.36 Uniform treatment of pragmatic relations Pragmatic relations in HDRB based broadly on the distinction made in PDTB between semantic and pragmatic relations (Sanders et al., 1992) Discourse relations are viewed as Semantic when they relate the propositional content of the arguments Pragmatic when their relations have to be inferred from the propositional content of the arguments In HDRB, the PDTB pragmatic senses are replaced with a uniform three-way classication epistemic, speech-act, and propositional
  • Slide 37
  • December 2009, ICONHDRB, Umangi et al.37 Uniform treatment of pragmatic relations Epistemic and speech-act inferences based on Sweetsers (Sweetser, 1990) analysis of polysemous connectives in terms of conceptual domains Epistemic interpretation obtained when the relation involves a conclusion (expressed in one argument) based on some observation (expressed in the other argument) John loved Mary, because he came back Speech-act interpretations obtain when the relation is between a speech-act and the speakers justication for performing it What are you doing tonight, because theres a good movie on. In both, the relation is a pragmatic one, since they involve the inference of a modality-epistemic (e.g., conclude(speaker, X)) or a speech-act (e.g., ask(speaker, X)) - that takes scope over the propositional content of one of the arguments (X)
  • Slide 38
  • December 2009, ICONHDRB, Umangi et al.38 Uniform treatment of pragmatic relations Propositional inference involves the inference of a complete proposition Relation taken to hold between this inferred proposition and the propositional content of one of the arguments Example below illustrates pragmatic concession of the propositional subtype (20) [ .] { } [One of the drivers denied his involvement in the issue inspite of his knowledge about the weapons]. But {the court said that had he informed the police on time, the blast could have been prevented}.
  • Slide 39
  • December 2009, ICONHDRB, Umangi et al.39 The Goal sense Under the Contingency class, a new type Goal has been added Applies to relations where the situation described in one of the arguments is the goal of the situation described in the other argument (which enables the achievement of the goal) The argument describing the goal marked as Arg2, and the other argument is marked Arg1 (Ex.21) (21) [ ] [ ], { .} Subhash has alleged that [the RJD chief wants to give a ticket to Rana] so that {he does not become a government witness in the fodder scam trial}. In PDTB, goal subsumed by the result subtype Distinguishing between cause and goal has important consequences for example, in the way questions are formulated over the relation
  • Slide 40
  • December 2009, ICONHDRB, Umangi et al.40 Additional Exploration of Discourse Adverbials (22) [ .] { .} [The coastal vegetation on the west coast of the Andaman has been completely destroyed due to wild waves]. In addition, {the coral reefs have also been damaged}. (23) [ .] PSU ONGC , , . { PSU .} [Raha was avoiding the formalities from the beginning itself.] Of all the oil PSUs, ONGC is the only company which has not even signed the agreement on the prot, loss, etc. of this scal year. In addition, {Raha had also refused to participate in the PSU review meeting}. Example 22 and 23 illustrate adjacent and non-adjacent arguments of isalAvA, respectively
  • Slide 41
  • December 2009, ICONHDRB, Umangi et al.41 References Florian Wolf and Edward Gibson. 2005. Representing Discourse Coherence: A Corpus-Based Study in Computational Linguistics, Vol. 31, No. 2.,pp. 249-288. Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. 2001. Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory. In Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue, Eurospeech 2001, Denmark, September 2001. NicholasAsher. 1993. Reference to Abstract Objects in Discourse. Kluwer, Dordrecht. Raya Begum, Samar Husain, Arun Dhwaj, Dipti Mishra Sharma, Lakshmi Bai, and Rajeev Sangal. 2008. Dependency annotation scheme for Indian languages. In Proceedings of IJCNLP-2008. Alistair Knott. 1996. A Data-driven Methodology for Motivating a Set of Coherence Relations. Ph.D.thesis, Department of Articial Intelligence, University of Edinburgh.
  • Slide 42
  • December 2009, ICONHDRB, Umangi et al.42 References Yamuna Kachru. 2006. Hindi. John Benjamins Publishing Co., Amsterdam. James R. Martin. 1992. English text: System and structure. Benjamins, Amsterdam. Lucie Mladova, Sarka Zikanova, and Eva Hajicova. 2008. From sentence to discourse: Building an annotation scheme for discourse based on Prague dependency treebank. In Proceedings of LREC-2008. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Aravind Joshi, and Bonnie Webber. 2007. Attribution and its annotation in the Penn discourse treebank. Traitement Automatique des Langues, Special Issue on Computational Approaches to Document and Discourse, 47(2). Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. In Proceedings of LREC-2008. Chaturbhuj Sahay. 2007. Hindi Padvigyaan. Kumar Prakashan. Agra.
  • Slide 43
  • December 2009, ICONHDRB, Umangi et al.43 References Ted J. M. Sanders, Wilbert P. M. Spooren, and Leo G. M. Noordman. 1992. Toward a taxonomy of coherence relations. Discourse Processes, 15:135. Eve Sweetser. 1990. From etymology to pragmatics: Metaphorical and cultural aspects of semantic structure. Cambridge University Press. Bonnie Webber and Aravind Joshi. 1998. Anchoring a lexicalized tree- adjoining grammar for discourse. In Proceedings of the ACL/COLING Workshop on Discourse Relations and Discourse Markers. BonnieWebber, Aravind Joshi, Matthew Stone, and AlistairKnott. 2003. Anaphora and discourse structure. Computational Linguistics, 29(4):545 587. Nianwen Xue. 2005. Annotating discourse connectives in the Chinese treebank. In Proceedings of the ACL Workshop on Frontiers in Corpus Annotation II : Pie in the Sky. Deniz Zeyrek and Bonnie Webber. 2008. A discourse resource for Turkish: Annotating discourse connectives in the metu corpus. In Proceedings of IJCNLP-2008.