16
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2804–2819, November 16–20, 2020. c 2020 Association for Computational Linguistics 2804 QADiscourse - Discourse Relations as QA Pairs: Representation, Crowdsourcing and Baselines Valentina Pyatkin 1 Ayal Klein 1 Reut Tsarfaty 1,2 Ido Dagan 1 1 Computer Science Department, Bar Ilan University 2 Allen Institute for Artificial Intelligence {valpyatkin,ayal.s.klein,ido.k.dagan,reut.tsarfaty}@gmail.com Abstract Discourse relations describe how two propo- sitions relate to one another, and identifying them automatically is an integral part of nat- ural language understanding. However, anno- tating discourse relations typically requires ex- pert annotators. Recently, different semantic aspects of a sentence have been represented and crowd-sourced via question-and-answer (QA) pairs. This paper proposes a novel rep- resentation of discourse relations as QA pairs, which in turn allows us to crowd-source wide- coverage data annotated with discourse rela- tions, via an intuitively appealing interface for composing such questions and answers. Based on our proposed representation, we col- lect a novel and wide-coverage QADiscourse dataset, and present baseline algorithms for predicting QADiscourse relations. 1 Introduction Relations between propositions are commonly termed Discourse Relations, and their importance to the automatic understanding of the content and structure of narratives has been extensively studied (Grosz and Sidner, 1986; Asher et al., 2003; Web- ber et al., 2012). The automatic parsing of such relations is thus relevant to multiple areas of NLP research, from extractive tasks such as document summarization to automatic analyses of narratives and event chains (Li et al., 2016; Lee and Gold- wasser, 2019). So far, discourse annotation has been mainly conducted by experts, relying on carefully crafted linguistic schemes. Two cases in point are PDTB (Prasad et al., 2008; Webber et al., 2019) and RST (Mann and Thompson, 1988; Carlson et al., 2002). Such annotation however is slow and costly. Crowd-sourcing discourse relations, instead of us- ing experts, can be very useful for obtaining larger- scale training data for discourse parsers. The executions were spurred by lawmakers requesting action to curb rising crime rates. What is the reason lawmakers requested action? to curb rising crime rates What is the result of lawmakers requesting action to curb rising crime rates? the executions were spurred I decided to do a press conference [...], and I did that going into it knowing there would be consequences. Despite what did I decide to do a press conference? know- ing there would be consequences Table 1: Sentences with their corresponding Question- and-Answer pairs. The bottom example shows how im- plicit relations are captured as QAs. One plausible way for acquiring linguistically meaningful annotations from laymen is using the relatively recent QA methodology, that is, convert- ing a set of linguistic concepts to intuitively simple Question-and-Answer (QA) pairs. Indeed, casting the semantic annotations of individual propositions as narrating a QA pair gained increasing attention in recent years, ranging from QA driven Semantic Role Labeling (QASRL) (He et al., 2015; FitzGer- ald et al., 2018; Roit et al., 2020) to covering all semantic relations as in QAMR (Michael et al., 2018). These representations were also shown to improve downstream tasks, for example by provid- ing indirect supervision for recent MLMs (He et al., 2020). In this work we address the challenge of crowd- sourcing information of higher complexity, that of discourse relations, using QA pairs. We present the QADiscourse approach to representing intra- sentential Discourse Relations as QA pairs, and we show that with an appropriate crowd-sourcing setup, complex relations between clauses can be effectively recognized by non-experts. This lay- man annotation could also easily be ported to other languages and domains. Specifically, we define the QADiscourse task to be the detection of the two discourse units, and the

QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2804–2819,November 16–20, 2020. c©2020 Association for Computational Linguistics

2804

QADiscourse - Discourse Relations as QA Pairs: Representation,Crowdsourcing and Baselines

Valentina Pyatkin1 Ayal Klein1 Reut Tsarfaty1,2 Ido Dagan1

1Computer Science Department, Bar Ilan University2Allen Institute for Artificial Intelligence

{valpyatkin,ayal.s.klein,ido.k.dagan,reut.tsarfaty}@gmail.com

Abstract

Discourse relations describe how two propo-sitions relate to one another, and identifyingthem automatically is an integral part of nat-ural language understanding. However, anno-tating discourse relations typically requires ex-pert annotators. Recently, different semanticaspects of a sentence have been representedand crowd-sourced via question-and-answer(QA) pairs. This paper proposes a novel rep-resentation of discourse relations as QA pairs,which in turn allows us to crowd-source wide-coverage data annotated with discourse rela-tions, via an intuitively appealing interfacefor composing such questions and answers.Based on our proposed representation, we col-lect a novel and wide-coverage QADiscoursedataset, and present baseline algorithms forpredicting QADiscourse relations.

1 Introduction

Relations between propositions are commonlytermed Discourse Relations, and their importanceto the automatic understanding of the content andstructure of narratives has been extensively studied(Grosz and Sidner, 1986; Asher et al., 2003; Web-ber et al., 2012). The automatic parsing of suchrelations is thus relevant to multiple areas of NLPresearch, from extractive tasks such as documentsummarization to automatic analyses of narrativesand event chains (Li et al., 2016; Lee and Gold-wasser, 2019).

So far, discourse annotation has been mainlyconducted by experts, relying on carefully craftedlinguistic schemes. Two cases in point are PDTB(Prasad et al., 2008; Webber et al., 2019) andRST (Mann and Thompson, 1988; Carlson et al.,2002). Such annotation however is slow and costly.Crowd-sourcing discourse relations, instead of us-ing experts, can be very useful for obtaining larger-scale training data for discourse parsers.

The executions were spurred by lawmakers requestingaction to curb rising crime rates.What is the reason lawmakers requested action? to curbrising crime ratesWhat is the result of lawmakers requesting action to curbrising crime rates? the executions were spurredI decided to do a press conference [...], and I did thatgoing into it knowing there would be consequences.Despite what did I decide to do a press conference? know-ing there would be consequences

Table 1: Sentences with their corresponding Question-and-Answer pairs. The bottom example shows how im-plicit relations are captured as QAs.

One plausible way for acquiring linguisticallymeaningful annotations from laymen is using therelatively recent QA methodology, that is, convert-ing a set of linguistic concepts to intuitively simpleQuestion-and-Answer (QA) pairs. Indeed, castingthe semantic annotations of individual propositionsas narrating a QA pair gained increasing attentionin recent years, ranging from QA driven SemanticRole Labeling (QASRL) (He et al., 2015; FitzGer-ald et al., 2018; Roit et al., 2020) to covering allsemantic relations as in QAMR (Michael et al.,2018). These representations were also shown toimprove downstream tasks, for example by provid-ing indirect supervision for recent MLMs (He et al.,2020).

In this work we address the challenge of crowd-sourcing information of higher complexity, that ofdiscourse relations, using QA pairs. We presentthe QADiscourse approach to representing intra-sentential Discourse Relations as QA pairs, andwe show that with an appropriate crowd-sourcingsetup, complex relations between clauses can beeffectively recognized by non-experts. This lay-man annotation could also easily be ported to otherlanguages and domains.

Specifically, we define the QADiscourse task tobe the detection of the two discourse units, and the

Page 2: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2805

labeling of the relation sense between them. Thetwo units are represented in the question body andthe answer, respectively, while the question type,as expressed by its prefix, represents the discourserelation sense between them. This representationis illustrated in Table 1 and the types of questionswe focus on are detailed in Table 3. This schemeallows us to ask about both explicit and implicit re-lations. To our knowledge, there has been no workon collecting such question types in a systematicway.

The contribution of this paper is thus manifold.(i) We propose a novel QA-based representation fordiscourse relations reflecting a subset of the sensetaxonomy of PDTB 3.0 (Webber et al., 2019). (ii)We propose an annotation methodology to crowd-source such discourse-relations QA pairs. And,(iii) given this representation and annotation setup,we collected QADiscourse annotations for about9000 sentences, resulting in more than 16600 QApairs, which we will openly release. Finally, (iv)we implement a QADiscourse parser, serving asa baseline for predicting discourse questions andrespective answers, capturing multiple discourse-based propositions in a sentence.

2 Background

Discourse Relations Discourse Relations in thePenn Discourse Treebank (PDTB) (Prasad et al.,2008; Webber et al., 2019), as seen in ex. (1),are represented by two arguments, labeled ei-ther Arg1 or Arg2, the discourse connective (incase of an explicit relation) and finally the rela-tion sense(s) between the two, in this case bothTemporal.Asynchronous.Succession and Contin-gency.Cause.Reason.

(1) BankAmerica climbed 1 3/4 to 30 (Arg1)after PaineWebber boosted its investmentopinion on the stock to its highest rating(Arg2).

These relations are called shallow discourse re-lations since, contrary to the Rhetorical StructureTheory (RST) (Mann and Thompson, 1988; Carl-son et al., 2002), they do not recursively build atree. The PDTB organizes their sense taxonomy,of which examples can be seen in Table 3, intothree levels, with the last one denoting the direc-tion of the relation. The PDTB annotation schemehas additionally been shown to be portable to otherlanguages (Zeyrek et al., 2018; Long et al., 2020).

QASRL: Back in Warsaw that year, Chopin heard Nic-colo Paganini play the violin, and composed a set of vari-ations, Souvenir de Paganini.What did someone hear? Niccolo Paganini play the violinWhen did someone compose something? that yearQAMR: Actor and television host Gary Collins died yes-terday at age 74.What kind of host was Collins? televisionHow old was Gary Collins? 74

Table 2: Examples of QASRL and QAMR annotations.

Semantic QA Approaches Using QA structuresto represent semantic propositions has been pro-posed as a way to generate “soft” annotations,where the resulting representation is formulatedusing natural language, which is shown to be moreintuitive for untrained annotators (He et al., 2015).This allows much quicker, more large-scale an-notation processes (FitzGerald et al., 2018) andwhen used in a more controlled crowd-sourcingsetup, can produce high-coverage quality annota-tions (Roit et al., 2020).

As displayed in Table 2, both QASRL andQAMR collect a set of QA pairs, each representinga single proposition, for a sentence. In QASRLthe main target is a predicate, which is emphasizedby replacing all content words in the question be-sides the predicate with a placeholder. The answerconstitutes a span of the sentence. The annota-tion process itself for QASRL is very controlled,by suggesting questions created with a finite-stateautomaton. QAMR, on the other hand, allows tofreely ask all kinds of questions about all types ofcontent words in a sentence.

QA Approaches for Discourse The relation be-tween discourse structures and questioning hasbeen pointed out by Van Kuppevelt (1995), whoclaims that the discourse is driven by explicit andimplicit questions: a writer carries a topic for-ward by answering anticipated questions given thepreceding context. Roberts (2012) introduces theterm Question Under Discussion (QUD), whichstands for a question that interlocuters accept indiscourse and engage in finding its answer. Morerecently, there has been work on annotating QUDs,by asking workers to identify questions raised bythe text and checking whether or not these raisedquestions get answered in the following discourse(Westera and Rohde, 2019; Westera et al., 2020).These QUD annotations are conceptually related toQADiscourse by representing discourse informa-tion through QAs, solicited from laymen speakers.

Page 3: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2806

The main difference lies in the propositions cap-tured: we collect questions that have an answer inthe sentence, targeting specific relation types. Inthe QUD annotations (Westera et al., 2020) anytype of question can be asked that might or mightnot be answered in the following discourse.

Previous Discourse Parsing Efforts Most ofthe recent work on models for (shallow) discourseparsing focuses on specific subtasks, for exampleon argument identification (Knaebel et al., 2019),or discourse sense classification (Dai and Huang,2019; Shi and Demberg, 2019; Van Ngo et al.,2019). Full (shallow) discourse parsers tend touse a pipeline approach, for example by havingseparate classifiers for implicit and explicit rela-tions (Lin et al., 2014), or by building differentmodels for intra- vs. inter-sentence relations (Biranand McKeown, 2015). We also adopt the pipelineapproach for our baseline model (Section 6), whichperforms both relation classification and argumentidentification, since our QA pairs jointly representarguments and relations.

Previous Discourse Crowdsourcing EffortsThere has been research on how to crowd-sourcediscourse relation annotations. Kawahara et al.(2014) crowd-source Japanese discourse relationsand simplify the task by reducing the tagset andextracting the argument spans automatically. Afollow-up paper found that the data quality of theseJapanese annotations was lacking compared to ex-pert annotations (Kishimoto et al., 2018). Further-more, Yung et al. (2019) even posit that it is impos-sible to crowdsource high quality discourse senseannotations and they suggest to re-formulate thetask as a discourse connective insertion problem.This approach has previously also been used inother configurations (Rohde et al., 2016; Schol-man and Demberg, 2017). Similarly to our QADis-course approach, inserting connectives also workswith soft natural language annotations, as we pro-pose, but it simplifies the task greatly, by onlyannotating the connective, rather than retrievingcomplete discourse relations.

3 Representing Discourse Relations asQA pairs

In this section we discuss how to represent shal-low discourse relations through QA pairs. For anoverview, consider the second sentence in Table 1,and the two predicates ‘decided’ and ‘knowing’,

each being part of a discourse unit. The senseof the discourse relation between these two unitscan be characterized by the question prefix “De-spite what ...?” (see Table 3). Accordingly, the fullQA pair represents the proposition asserted by thisdiscourse relation, with the question and answercorresponding to the two discourse units. A com-plete QADiscourse representation for a text wouldthus consist of a set of such QAs, representing allpropositions asserted through discourse relations.

The Discourse Relation Sense We want ourquestions to denote relation senses. To define theset of discourse relations covered by our approach,we derived a set of question templates that covermost discourse relations in the PDTB 3.0 (Web-ber et al., 2019; Prasad et al., 2008), as shown inTable 3. Each question template starts with a ques-tion prefix, which specifies the relation sense. Theplaceholder X is completed to capture the discourseunit referred to by the question, as in Table 1.

Few PDTB senses are not covered by our ques-tion prefixes. First, senses with pragmatic specifi-cations like Belief and SpeechAct were collapsedinto their general sense. Second, three Expansionsenses were not included because they usually donot assert a new “informational” proposition, aboutwhich a question could be asked, but rather capturestructural properties of the text. One of those is Ex-pansion.Conjunction, which is one of the most fre-quently occurring senses in the PDTB, especiallyin intra-sentential VP conjunctions, where it makesup about 70% of the sense instances (Webber et al.,2016). Ex. (2) displays a discourse relation withtwo senses, one of which Expansion.Conjunction.While it is natural to come up with a question target-ing the causal sense, the conjunction relation doesnot seem to assert any proposition about which aninformational question may be asked.

(2) “Digital Equipment announced its firstmainframe computers, targeting IBM’slargest market and heating up the in-dustry’s biggest rivalry.” (explicit: Ex-pansion.Conjunction, implicit: Contin-gency.Cause.Result)

Finally, we removed Purpose as a (somewhat sub-tle) individual sense and subsumed it with our twocausal questions.

Our desiderata for the question templates are asfollows. First, we want the question prefixes tobe applicable to as many scenarios as possible in

Page 4: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2807

PDTB Sense Question TemplateExpansion.Substitution Instead of what X?Expansion.Disjunction What is an alternative to X?Expansion.Exception Except when X?Comparison.Concession Despite what X?Comparison.Contrast What is contrasted with X?Expansion.Level-of-detail What is an example of X?Comparison.Similarity What is similar to X?Temporal.Asynchronous After/Before what X?

Until/Since when X?Temporal.Synchronous While what X?Contingency.Condition In what case X?Contingency.Neg.-cond. Unless what X?Expansion.Manner In what manner X?Contingency.Cause What is the result of X?

What is the reason X

Table 3: Informational PDTB senses mapped to ourquestion templates.

which discourse relations can occur, while at thesame time ideally adhering to a one-to-one map-ping of sense to question. Similarly, we avoid ques-tion templates that are too general. The WHEN-Question in QASRL (Table 2), for instance, can beused for either Temporal or Conditional relations.Here we employ more specific question prefixes toremove this ambiguity. Finally, as multiple relationsenses can hold between two discourse units (Ro-hde et al., 2018), we similarly allow multiple QApairs for the same two discourse units.

The Discourse Units The two sentence frag-ments, typically clauses, that we relate with a ques-tion are the discourse units. In determining whatmakes a discourse unit, we include verbal predi-cates, noun phrases and adverbial phrases as poten-tial targets. This, for example, would also coversuch instances: “Because of the rain ...” or “..., al-beit warily”. We call the corresponding verb, nounor adverb heading a discourse unit a target.

A question is created by choosing a question pre-fix, an auxiliary, if necessary, and copying wordsfrom the sentence. It can then be manually adjustedto be made grammatical. Similarly, the discourseunit making up the answer consists of words copiedfrom the sentence and can be modified to be madegrammatical. Our question and answer format thusdeviates considerably from the QASRL represen-tation. By not introducing placeholders, questionssound more natural and easy to answer compared toQASRL, while still being more controlled than thecompletely free form questions of QAMR. In addi-tion, allowing for small grammatical adjustmentsintroduces valuable flexibility which contribute tothe intuitiveness of the annotations.

1. And I also feel like in a capitalistic society, checks andbalances happen when there is competition.In what case do checks and balances happen? when thereis competition in a capitalistic society2. Whilst staying in the hotel, the Wikimedian groupmet two MEPs who chose it in-preference to dramaticallymore-expensive Strasbourg accommodation.What is contrasted with the hotel? dramatically more-expensive Strasbourg accommodation3. There were no fare hikes announced as both passengerand freight fares had been increased last month.What is the reason there were no fare hikes announced?as both passenger and freight fares had been increased lastmonthWhat is the result of both passenger and freight fareshaving been increased last month? There were no farehikes announced

Table 4: Example sentences with their correspondingQuestion-and-Answer pairs.

Relation Directionality Discourse relations areoften directional. Our QA format introduces direc-tionality by placing discourse units into either thequestion or answer. In some question prefixes, asingle order is dictated by the question. As seenin ex. 1 of Table 4, because the question asks forthe condition, the condition itself will always bein the answer. Another ordering pattern occursfor symmetric relations, meaning that the relation’sassertion remains the same no matter how the argu-ments are placed into the question and answer, asin ex. 2 in Table 4. Finally, certain pairs of relationsenses are considered reversed, such as for causal(reason vs. result) and some of the temporal (be-fore vs. after) question prefixes. In this case, twoQA pairs with different question prefixes can de-note the same assertion when the target discourseunits are reversed, as shown in ex. 3 in Table 4.These patterns of directionality impact annotationand evaluation, as would be described later on.

4 The Crowdsourcing Process

Pool of Annotators To find a suitable group ofannotators we followed the Controlled Crowdsourc-ing Methodology of Roit et al. (2020). We firstreleased two trial tasks, after which we selectedthe best performing workers. These workers thenunderwent two short training cycles, estimated totake about an hour each, which involved readingthe task guidelines, consisting of 42 slides1, com-pleting 30 HITs per round and reading personalfeedback after each round (preparing these feed-

1https://github.com/ValentinaPy/QADiscourse

Page 5: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2808

backs consumed about 4 author work days). 11workers successfully completed the training.

For collecting production annotations of the Devand Test Sets, each sentence was annotated by 2workers independently, followed by a third workerwho adjudicated their QA pairs to produce the finalset. For the Train Set, sentences were annotated bya single worker, without adjudication.

Preprocessing In preprocessing, question targetsare extracted automatically using heuristics andPOS tags: a sentence is segmented using punc-tuation and discourse connectives (from a list ofconnectives from the PDTB). For each segment,we treat the last verb in a consecutive span of verbsas a separate target. In case a segment does notcontain a verb, but does start with a discourse con-nective, we choose one of the nouns (or adverbs)as target. The following illustrates our target ex-traction: [Despite labor-shortage warnings,] [80%aim for first-year wage increases of under 4%;][and 77% say they’d try to replace workers,] [ifstruck,] [or would consider it.]

Annotation Tool and Procedure Using Ama-zon Mechanical Turk, we implemented two inter-faces2, one for the QA generation and one for theQA adjudication step.

In the QA generation interface, workers areshown a sentence with all target words in bold.Workers are instructed to generate one or morequestions that relate two of these target words. Thequestion is generated by first choosing a questionprefix, then, if applicable, an auxiliary, then se-lecting one or more spans from the sentence toform the complete question, and lastly, change it tomake it grammatical. Given the generated question,the next step involves answering that question byselecting span(s) from the sentence. Again, the an-swer can also be amended to be made grammatical.

The QA adjudication interface displays a sen-tence and all the QA pairs produced for that sen-tence by two annotators. For each QA pair theadjudicator is asked to either label it as correct,not correct or correct, but not grammatical. Du-plicates and nonsensical QA pairs labeled as notcorrect are omitted from the final dataset. As a laststep, the first author manually corrected all the notgrammatical instances.

Data and Cost We sampled sentences from theWikinews and Wikipedia sections of Large Scale

2Please refer to the appendix for all UI screenshots.

Dataset Split Sentences QuestionsWikinews Train 3098 4760Wikinews Dev 669 1108Wikinews Test 658 1498Wikipedia Train 3277 6225Wikipedia Dev 667 1524Wikipedia Test 678 1498Overall 9047 16613

Table 5: Dataset Statistics: Number of sentences con-taining at least 1 QA annotation and the total numberof QAs collected.

QA Prefix Count ProportionIn what manner X? 4225 25%What is the reason X? 3238 19%What is the result of X? 2735 16 %What is an example of X? 1757 11 %After what X? 1099 7 %While what X? 1060 6 %In what case X? 509 3 %Despite what X? 477 3 %What is contrasted with X? 317 2 %Before what X? 299 2 %Since when X? 279 2 %What is similar to X? 218 1 %Until when X? 155 1 %Instead of what X? 105 1 %What is an alternative to X? 92 ≤ 1 %Except when X? 27 ≤ 1 %Unless what X? 21 ≤ 1%

Table 6: Counts of collected question types.

QASRL (FitzGerald et al., 2018) while followingtheir Train, Dev & Test splits. Descriptive statisticsfor the final dataset are shown in Table 5 and 6.

Annotating a sentence of Dev and Test yielded2.11 QA pairs with a cost of 50.3¢ on average. ForTrain, the average cost was 37.1¢ for 1.72 QAs persentence.

5 Dataset Evaluation

5.1 Evaluation MetricsWe aim to evaluate QA pairs, as the output of boththe annotation process and the question generationand answering model, which are not the same asdiscourse relation triplets. There are multiple diffi-culties that arise when evaluating the QADiscoursesetup. We allow multiple labels per propositionpair and thus need evaluation measures suitable formulti-label classification. Annotators are generat-ing the questions and answers, which contrary toa pure categorical labelling task implies that wehave to take into consideration question and an-swer paraphrasing and natural language generationinconsistencies. This requires us to use metricsthat create alignments between sets of QAs, whichmeans that existing discourse relation evaluationmethods, such as from CoNLL-2015 (Xue et al.,

Page 6: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2809

2015), are not applicable. The following metrics,which we apply for both the quality analysis ofthe dataset and the parser evaluation, are closelyinspired by previous work on collecting seman-tic annotations with QA pairs (Roit et al., 2020;FitzGerald et al., 2018).

Unlabeled Question and Answer Span Detec-tion (UQA) (F1) This metric is inspired by thequestion alignment metric for QASRL, which takesinto account that there are many ways to phrase aquestion and therefore an exact match metric willbe too harsh. Given a sentence and two sets of QApairs produced for that sentence, such as gold andpredicted sets, we want to match the QAs from thetwo sets for comparison. A QA pair is aligned withanother QA pair that has the maximal intersectionover union (IOU) ≥ 0.5 on a token-level, or elseremains unaligned3. Since we allow multiple QApairs for two targets, we also allow one-to-manyand many-to-many alignments. As we are evalu-ating unlabeled relations at this point, we do notconsider relation direction and therefore do not dif-ferentiate between question and answer spans.

Labeled Question and Answer Span Detection(LQA) (Accuracy) Given the previously pro-duced alignments from UQA we check for theexact match of aligned question prefixes. For many-to-many and many-to-one alignments we count ascorrect if there is overlap of at least one questionprefix. Reversed and symmetric question prefixesare converted to a more general label for fair com-parison.

5.2 Dataset Quality

Inter-Annotator Agreement (IAA) To calcu-late the agreement between individual annotatorswe use the above metrics (UQA and LQA) for dif-ferent worker-vs-worker configurations. The setupis the following: A set of 4 workers annotates thesame sentences (around 60), from which we thencalculate the agreement between all the possiblepairs of workers. We repeat this process 3 times andshow the average agreement scores in Table 7. Thescores after adjudication, pertaining to the actualdataset agreement level, are produced by compar-ing the resulting annotation of two worker triplets,each consisting of two annotators and a separateadjudicator on the same data, averaged over 3 sam-

3The average length for tokenized questions and answersis 12.22 and 10.27 respectively.

UQA LQABefore Adjud. 76.87 56.64After Adjud. 87.44 65.46

Table 7: IAA scores before and after adjudication.

ples of 60 sentences each. These results show thatadjudication notably improves agreement.

5.3 Agreement with Expert Set

Our Expert set consists of 25 sentences annotatedwith QA pairs by the first author of the paper. Com-paring the adjudicated crowdsourced annotationswith the Expert Set yields a UQA (LQA) of 93.9(80), indicating a high quality of our collected an-notations. The main issue in disagreement arisesfrom sentences that do not contain overt proposi-tional discourse relations, where workers attemptto ask questions anyways, resulting in sometimesunnatural or overly implicit questions.

5.4 Comparison with PDTB

We crowdsourced QA annotations of 60 sentencesfrom section 20 of the PDTB (commonly used asTrain) with our QA annotation protocol. The PDTBarguments are aligned with the QA-pairs using theUQA metric, by considering Arg1 and Arg2 asthe text spans to be aligned with the question andanswer text4, yielding 83.2 Precision, 87.5 Recalland an F1 of 85.3.

A manual comparison of the PDTB labels withthe Question Prefixes reveals that in most of thecases the senses overlap in meaning, with some ex-ceptions on both sides. 60% of aligned annotationscorrespond exactly in the discourse relation sensethey express. The remaining 40% of the QA-pairsbelong to either of the following categories:

(1) Discourse relations that we deemed to benon-informational at the propositional level weremany times still annotated with our QA pairs. Takethis sentence: [...], a Soviet-controlled regime re-mains in Kabul, the refugees sit in their camps, andthe restoration of Afghan freedom seems as far offas ever. The PDTB posits an Exp.Conjunction rela-tion between the two cursive arguments, which is arelation type that we do not cover in the QA frame-work, yet our annotators saw an implied causalrelation which they expressed with the following(sensible) QA pair: What is the reason the restora-tion of Afghan freedom seems as far off as ever? -

4Such alignment is usually straightforward, since annota-tors do not add content words when producing QAs.

Page 7: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2810

Frank Carlucci III was named to this telecommunicationscompanys board, filling the vacancy created by the deathof William Sobey last May. (Contingency.Cause.Result)After what was Frank Carlucci III named to this telecom-munications company’s board? the death of William Sobeylast mayWhat is the reason Frank Carlucci III was named to thistelecommunications company’s board? filling the vacancycreated by the death of William Sobey last May

Table 8: Example of a QADiscourse relation which wasnot captured in the PDTB.

a soviet-controlled regime remains in Kabul.(2) Interestingly, we observe that some annota-

tion decision difficulties described in the PDTB(Webber et al., 2019) are also mirrored in our col-lected data. One of those arising ambiguities isthe difference between Comparison.Contrast andComparison.Concession, in our case Despite whatand What is contrasted with. In the manually ana-lyzed data sample, 3 such confusions were foundbetween the QADiscourse and the PDTB annota-tions.

(3) There were 15 instances of PDTB relationsenses that were erroneously not annotated withan appropriate QA pair, even though a suitableQuestion Prefix exists, corresponding to some ofthe 12.5% recall misses in the comparison.

(4) On the contrary, there were 36 QA instancesthat capture appropriate propositions which werecompletely missed in the PDTB 5. For example,in Table 8, the PDTB only mentions the causalrelation, while QADiscourse found both the causaland the temporal sense:

Additionally we noticed that annotators tend toask “What is similar to..?” questions about con-junctions, indicating that conjoined clauses seemto imply a similarity between them, while the simi-larity relation in the PDTB is rather used in moreexplicit comparison contexts. The “In what case..?”questions were sometimes used for adjuncts speci-fying a time or place. Overall, these comparisonsshow that agreement with the PDTB is good, withQADiscourse even finding additional valid rela-tions, indicating that it is feasible to crowdsourcehigh-quality discourse relations via QADiscourse.

5.5 Comparison with QAMR and QASRL

While commonly treated as two distinct levels oftextual annotations, there are nevertheless somecommonalities between shallow discourse relations

5The full list of these instances can be found in the ap-pendix.

Question Prefix CountWhat is the reason/result of 23/20

In what manner 19While/After/Before what 19

What is an example of 10Since/Until when 5

In what case 4Despite what 1

Table 9: Count of QADiscourse Question Prefixes ofquestions that could be aligned to QAMR.

and semantic roles. This interplay of discourseand semantics has also been noted by Prasad et al.(2015), who made use of clausal adjunct annota-tions in PropBank to enrich intra-sentential dis-course annotations and vice versa. Similarly, wefound that there are questions in QASRL, QAMRand QADiscourse which express kindred relations:Manner, Condition, Causal and Temporal relationscould all be asked about using QASRL-like WH-Question. But then the point of question ambiguityarises: if “When” can be used to ask about con-ditional relations, it is more often also used to de-note temporal relations. This under-specificationbecomes problematic when attempting to map be-tween QAs and labels from resources such as Prop-Bank. Therefore, despite the propositional overlapof some of the question types, QADiscourse addi-tionally enriches and refines QASRL annotations.

Since QAMR does not restrict itself to predicate-argument relations only, we performed an anal-ysis of whether annotators tend to ask aboutQADiscourse-type relations in a general QA set-ting. 965 sentences contain both QAMR andQADiscourse annotations, with 1505 QADiscoursepairs, of which we could align 101 (7%) to QAMRannotations, using the UQA-alignment algorithm.We conclude that QAMR and QADiscourse targetmostly different propositions and relation types.

Within the 101 QADiscourse QAs that werealigned with QAMR questions (Table 9), causaland temporal relations are very common, usuallyexpressed, as expected, by “Why” or “When” ques-tions in QAMR. In other cases, the aligned ques-tions express different relation senses. Notably, theQADiscourse In what manner relation aligns with a“How” QAMR question only once out of 19 cases.Often, it seems that QADiscourse annotators weretempted to ask a somewhat inappropriate mannerquestion while the relation between the predicateand the answer corresponded to a direct seman-tic role (like location) rather than to a discourse

Page 8: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2811

She said he “had friends in every political party ...”QADISC.: In what manner did he have friends?QAMR: Where does he have friends?ANSWER: in every political party... your internet access provider can still keep track ofwhat websites you visit, websites can collect informationabout you and so on.QADISC.: What is an example of something your internetaccess provider can still keep track of?QAMR: Your provider can keep track of what?ANSWER: what websites you visit

Table 10: Examples of interesting aligned cases be-tween QAMR and QADiscourse.

relation (first example in Table 10). The secondexample in Table 10 corresponds to a case wherethe predicate-answer relation has two senses, a dis-course sense captured by QADiscourse (What is anexample of ), as well as a semantic role (“theme”),captured by a “What” question in QAMR. Theseobservations suggest interesting future research onintegrating QADiscourse annotations with seman-tic role QA annotations, like QASRL and QAMR.

6 Baseline Model for QADiscourse

In this section we aim to devise a baseline discourseparser based on our proposed representation, whichaccepts a sentence as input and outputs QA pairs forall discourse relations in that sentence, to be trainedon our collected data. Similarly to previous workon discourse parsing (Section (1)), our proposedparser is a pipeline consisting of three phases: (i)question prefix prediction, (ii) question generation,and (iii) answer generation.

Formally, given a sentence X = x0, ..., xn witha set of indices I which mark target words (basedon the target extraction heuristics in Section 4), weaim to produce a set of QA-pairs (Qj , Aj) usingthe following pipeline:

1. Question Prefix Prediction: Let Ψ be the setof all Question Prefixes, each reflecting a relationsense from the list shown in Table 3. For eachtarget word xi, such that i ∈ I , we predict a setof possible question prefixes Pxi ⊆ Ψ. The setP =

⋃i∈I Pxi is now defined to be the set of all

prefixes for all targets in the sentence.2. Question Generation: For every question

prefix p ∈ P and all its relevant target words Pp ={xi|p ∈ Pxi}, predict question bodies for one ormore of the targets Q1

p, ..., Qmp .

3. Answer Generation: Let a full question FQjp

be defined by the concatenation of the questionprefix and the corresponding generated question

body FQjp = 〈p,Qj

p〉. Given a sentence X and thequestion FQj

p, we aim to generate an answer Ajp.

All in all, we can generate up to |I| × |Ψ| differ-ent QAs per sentence.

6.1 Question Prefix PredictionIn the first step of our pipeline we are given a sen-tence and a marked target, and we aim to predicta set of possible prefixes reflecting potential dis-course senses for the relation to be predicted. Weframe this task of predicting a set of prefixes as amulti-label classification task.

To represent I , the input sentence X =x0, ..., xn is concatenated with a binary target in-dicator, and special tokens are placed before andafter the target ti. The output of the system is a setof question prefixes Pxi .

We implement the model using BERT (Devlinet al., 2019) in its standard fine-tuning setting, ex-cept that the Softmax layer is replaced by a Sigmoidactivation function to support multi-label classi-fication. The predicted question prefixes are ob-tained by choosing those labels that have a logit>= τ = 0.3, which was tuned on Dev to maximizeUQA F1. Since the label distribution is skewed, weadd weights to the positive examples for the binarycross-entropy loss.

6.2 Question GenerationNext in our pipeline, given the sentence, a questionprefix and its relevant targets in the sentence, weaim to generate question bodies for one or moreof the targets. To this end, we employ a PointerGenerator model (Jia and Liang, 2016) such that theinput to the model is encoded as follows: [CLS]x1, x2...xn [SEP ] p [SEP ], with p ∈ P beingthe question prefix. Additionally, we concatenatea target indicator for all relevant targets Pp. Theoutput is one or more question bodies Qp separatedby a delimiter token: Q1

p [SEP ]Q2p [SEP ] ... Qm

p .The model then chooses whether to copy a word

from the input, or to predict a word during decod-ing. We use the ALLENNLP (Gardner et al., 2018)implementation of COPYNET (Gu et al., 2016) andadapt it to work with BERT encoding of the input.

6.3 Answer GenerationTo predict the answer given a full question, weuse BERT fine-tuned on SQUAD (Rajpurkar et al.,2016).6 We additionally fine-tune the model on

6https://huggingface.co/transformers/pretrained_models.html

Page 9: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2812

1. This process, [...], rather than maintaining it as a networkof unequal principalities, would ultimately be completed byCaesar’s successor [...]

4. A writer since his teens, Pratchett first came to prominencewith the Discworld novel [...]

Instead of what would this process [...] be completed byCaesar’s successor? - rather than maintaining it as a networkof unequal principalities

Since when did Pratchett a writer? - since his teens

2. Most decked vessels were mechanized, but two-thirds ofthe open vessels were traditional craft propelled by sails andoars.

5. Each segment of the search could last for several weeksbefore resupply in Western Australia.

What is contrasted with most decked vessels appearing mech-anized? - two-thirds of the open vessels were traditional craftpropelled by sails and oars

What is the reason each segment of the search could last forseveral weeks? - before resupply in Western Australia

3. It could hit Hawaii if it stays on its predicted path. 6. For Cook Island Maori , it was 29 % compared to 23 % ;for Tongans , 37 % to 29 % [...].

In what case could it hit Hawaii? - if it stays on its predictedpath

What is contrasted with it For Cook Island Maori? - 23 %

Table 11: Examples of the QA output of the full pipeline: On the left column successful predictions and on theright wrong predictions (4: not grammatical but sensible, 5: non-sensical but grammatical, 6: neither).

Dev TestUQA Precision 81.1 80.79UQA Recall 84.94 86.8UQA F1 82.98 83.69LQA Accuracy 67.49 66.59Prefix Accuracy 51.3 49.94

Table 12: Full pipeline performance for the QA-Modelevaluated with labeled and unlabeled QA-alignment.

a subset of our training data (all 5004 instanceswhere we could align the answer to a consecutivespan in the sentence). Instead of predicting or copy-ing words from the sentence, this model predictsstart and end indices in the sentence.

7 Results and Discussion

After running the full pipeline, we evaluate the pre-dicted set of QA-pairs against the gold set using theUQA and LQA metrics, described in section 5.1.Table 12 shows the results. Note that the LQA isdependent on the UQA, as it calculates the labeledaccuracy only for QA pairs that could be alignedwith UQA. The Prefix Accuracy measure comple-ments LQA by evaluating the overall accuracy ofpredicting a correct question prefix. For this base-line model it shows that generally only half of thegenerated questions have a question prefix equiv-alent to gold, leaving room for future models toimprove upon. While not comparable, Biran andMcKeown (2015), for example, mention an F1 of56.91 for predicting intra-sentential relations.

The scores in Table 13 show the results for thesubsequent individual steps, given gold input, eval-uated using a matching criterion of intersectionover union >= 0.5 with the respective gold span.

We randomly selected a sample of 50 predicted

Dev TestQuestion Prediction 71.9 65.9Answer Prediction 68.9 72.3

Table 13: Accuracy of answers predicted by the ques-tion and answer prediction model, given a Gold ques-tion as input, compared to the Gold spans.

QAs for a qualitative analysis. 22 instances fromthis sample were judged as correct, and 2 instanceswere correct despite not being mentioned in Gold.Examples of good predictions are shown on the leftcolumn in Table 11. The model is often able tolearn when to do the auxiliary flip from clause toquestion format and when to change the verb formof the target. Interestingly, whenever the model wasnot familiar with a specific verb, it chose a similarverb in the correct form, for example ‘appearing’in Ex. 2. The model is also able to form a ques-tion using non-adjacent spans of the sentence (Ex.1). Some predictions do not appear in the dataset,but make sense nonetheless. The analysis showed8 non-grammatical but sensible QAs (i.e. ex. 4,where the sense of the relation is still captured), 8non-sensical but grammatical QAs (ex. 5) and 7QAs that were neither (ex. 6). Lastly, we found 3QAs with good questions and wrong answers.

8 Conclusion

In this work, we show that discourse relations canbe represented as QA pairs. This intuitive represen-tation enables scalable, high-quality annotation viacrowdsourcing, which paves the way for learningrobust parsers of informational discourse QA pairs.In future work, we plan to extend the annotationprocess to also cover inter-sentential relations.

Page 10: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2813

Acknowledgments

We would like to thank Amit Moryossef for hishelp with the implementation of the frontend, andJulian Michael, Gabriel Stanovsky and the anony-mous reviewers for their feedback and suggestions.This work was supported in part by grants from In-tel Labs, Facebook, the Israel Science Foundationgrant 1951/17 and the German Research Founda-tion through the German-Israeli Project Coopera-tion (DIP, grant DA 1600/1-1) and by an ERC-StGgrant #677352 and an ISF grant #1739/26.

ReferencesNicholas Asher, Nicholas Michael Asher, and Alex

Lascarides. 2003. Logics of conversation. Cam-bridge University Press.

Or Biran and Kathleen McKeown. 2015. Pdtb dis-course parsing as a tagging task: The two taggersapproach. In Proceedings of the 16th Annual Meet-ing of the Special Interest Group on Discourse andDialogue, pages 96–104.

Lynn Carlson, Mary Ellen Okurowski, and DanielMarcu. 2002. RST discourse treebank. LinguisticData Consortium, University of Pennsylvania.

Zeyu Dai and Ruihong Huang. 2019. A regulariza-tion approach for incorporating event knowledgeand coreference relations into neural discourse pars-ing. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages2967–2978.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Nicholas FitzGerald, Julian Michael, Luheng He, andLuke Zettlemoyer. 2018. Large-scale qa-srl parsing.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 2051–2060.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.Allennlp: A deep semantic natural language process-ing platform. In Proceedings of Workshop for NLPOpen Source Software (NLP-OSS), pages 1–6.

Barbara J Grosz and Candace L Sidner. 1986. Atten-tion, intentions, and the structure of discourse. Com-putational linguistics, 12(3):175–204.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OKLi. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1631–1640.

Hangfeng He, Qiang Ning, and Dan Roth. 2020.Quase: Question-answer driven sentence encoding.In Proc. of the Annual Meeting of the Associationfor Computational Linguistics (ACL).

Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015.Question-answer driven semantic role labeling: Us-ing natural language to annotate natural language.In Proceedings of the 2015 conference on empiricalmethods in natural language processing, pages 643–653.

R. Jia and P. Liang. 2016. Data recombination for neu-ral semantic parsing. In Association for Computa-tional Linguistics (ACL).

Daisuke Kawahara, Yuichiro Machida, Tomohide Shi-bata, Sadao Kurohashi, Hayato Kobayashi, and Man-abu Sassano. 2014. Rapid development of a corpuswith discourse annotations using two-stage crowd-sourcing. In Proceedings of COLING 2014, the 25thInternational Conference on Computational Linguis-tics: Technical Papers, pages 269–278.

Yudai Kishimoto, Shinnosuke Sawada, Yugo Mu-rawaki, Daisuke Kawahara, and Sadao Kurohashi.2018. Improving crowdsourcing-based annotationof japanese discourse relations. In Proceedings ofthe Eleventh International Conference on LanguageResources and Evaluation (LREC 2018).

Rene Knaebel, Manfred Stede, and Sebastian Stober.2019. Window-based neural tagging for shallowdiscourse argument labeling. In Proceedings ofthe 23rd Conference on Computational Natural Lan-guage Learning (CoNLL), pages 768–777.

I-Ta Lee and Dan Goldwasser. 2019. Multi-relationalscript learning for discourse relations. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 4214–4226.

Junyi Jessy Li, Kapil Thadani, and Amanda Stent. 2016.The role of discourse units in near-extractive summa-rization. In Proceedings of the 17th Annual Meetingof the Special Interest Group on Discourse and Dia-logue, pages 137–147.

Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2014.A pdtb-styled end-to-end discourse parser. NaturalLanguage Engineering, 20(2):151–184.

Wanqiu Long, Xinyi Cai, James Reid, Bonnie Webber,and Deyi Xiong. 2020. Shallow discourse annota-tion for chinese ted talks. In Proceedings of The12th Language Resources and Evaluation Confer-ence, pages 1025–1032.

Page 11: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2814

William C Mann and Sandra A Thompson. 1988.Rhetorical structure theory: Toward a functional the-ory of text organization. Text-interdisciplinary Jour-nal for the Study of Discourse, 8(3):243–281.

Julian Michael, Gabriel Stanovsky, Luheng He, IdoDagan, and Luke Zettlemoyer. 2018. Crowdsourc-ing question-answer meaning representations. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 2 (Short Papers), pages 560–568.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-sakaki, Livio Robaldo, Aravind K Joshi, and Bon-nie L Webber. 2008. The penn discourse treebank2.0. In LREC. Citeseer.

Rashmi Prasad, Bonnie Webber, Alan Lee, SameerPradhan, and Aravind Joshi. 2015. Bridging sen-tential and discourse-level semantics through clausaladjuncts. In Proceedings of the First Workshop onLinking Computational Models of Lexical, Senten-tial and Discourse-level Semantics, pages 64–69.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392.

Craige Roberts. 2012. Information structure: Towardsan integrated formal theory of pragmatics. Seman-tics and Pragmatics, 5:6–1.

Hannah Rohde, Anna Dickinson, Nathan Schneider,Christopher NL Clark, Annie Louis, and BonnieWebber. 2016. Filling in the blanks in understand-ing discourse adverbials: Consistency, conflict, andcontext-dependence in a crowdsourced elicitationtask. In Proceedings of the 10th Linguistic Anno-tation Workshop held in conjunction with ACL 2016(LAW-X 2016), pages 49–58.

Hannah Rohde, Alexander Johnson, Nathan Schneider,and Bonnie Webber. 2018. Discourse coherence:Concurrent explicit and implicit relations. In Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 2257–2267.

Paul Roit, Ayal Klein, Daniela Stepanov, JonathanMamou, Julian Michael, Gabriel Stanovsky, LukeZettlemoyer, and Ido Dagan. 2020. Crowdsourcinga high-quality gold standard for qa-srl. In ACL 2020Proceedings, forthcoming. Association for Compu-tational Linguistics.

Merel Scholman and Vera Demberg. 2017. Crowd-sourcing discourse interpretations: On the influenceof context and the reliability of a connective inser-tion task. In Proceedings of the 11th Linguistic An-notation Workshop, pages 24–33.

Wei Shi and Vera Demberg. 2019. Next sentence pre-diction helps implicit discourse relation classifica-tion within and across domains. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 5794–5800.

Jan Van Kuppevelt. 1995. Discourse structure, topical-ity and questioning. Journal of linguistics, pages109–147.

Linh Van Ngo, Khoat Than, Thien Huu Nguyen, et al.2019. Employing the correspondence of relationsand connectives to identify implicit discourse rela-tions via label embeddings. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 4201–4207.

Bonnie Webber, Markus Egg, and Valia Kordoni. 2012.Discourse structure and language technology. Natu-ral Language Engineering, 18(4):437–490.

Bonnie Webber, Rashmi Prasad, Alan Lee, and Ar-avind Joshi. 2016. A discourse-annotated corpus ofconjoined vps. In Proceedings of the 10th LinguisticAnnotation Workshop held in conjunction with ACL2016 (LAW-X 2016), pages 22–31.

Bonnie Webber, Rashmi Prasad, Alan Lee, and Ar-avind Joshi. 2019. The penn discourse treebank 3.0annotation manual.

Matthijs Westera, Laia Mayol, and Hannah Rohde.2020. Ted-q: Ted talks and the questions they evoke.In Proceedings of The 12th Language Resources andEvaluation Conference, pages 1118–1127.

Matthijs Westera and Hannah Rohde. 2019. Askingbetween the lines: Elicitation of evoked questions intext. In Proceedings of the Amsterdam Colloquium.

Nianwen Xue, Hwee Tou Ng, Sameer Pradhan, RashmiPrasad, Christopher Bryant, and Attapol Rutherford.2015. The conll-2015 shared task on shallow dis-course parsing. In Proceedings of the NineteenthConference on Computational Natural LanguageLearning-Shared Task, pages 1–16.

Frances Yung, Merel CJ Scholman, and Vera Demberg.2019. Crowdsourcing discourse relation annotationsby a two-step connective insertion task. In The 13thLinguistic Annotation Workshop, page 16.

Deniz Zeyrek, Amalia Mendes, and Murathan Kurfalı.2018. Multilingual extension of pdtb-style annota-tion: The case of ted multilingual discourse bank. InProceedings of the 11th International Conference onLanguage Resources and Evaluation (LREC2018).

A Appendices

A.1 Reproducibility informationThe code to reproduce the QADiscourse model canbe found at https://github.com/ValentinaPy/QADiscourse.

Page 12: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2815

Calculating the weights for the Loss of thePrefix Classifier Each question prefix label isweighted by subtracting the label count from thetotal count of training instances and dividing it bythe label count: weightx = total instances −countx/(countx + 1e− 5).

A.2 Annotation DetailsThe number of examples and the details of the splitsare mentioned in the paper. The data collectionprocess has also been described in the main body ofthe paper. Here we add a more detailed descriptionof the Target Extraction Algorithm and screenshotsof the annotation interfaces.

Target Extraction Algorithm In order to ex-tract targets we use the following heuristics: Wesplit the sentence on the following punctuation: “,”“;” “:”. This provides an initial incomplete segmen-tation of clauses and subordinate clauses. We willtry to find at least one target in each segment.

We then split the resulting text spans from 1.using a set of discourse connectives. We had toremove the most ambiguous connectives from thelist, whose tokens might also have other syntacticfunctions, for example “so, as, to, about”, etc.

We then check the POS tags of the resultingspans and treat each consecutive span of verbs asa target, with the last verb in the consecutive spanbeing the target. In order to not treat cases suchas “is also studying” as separate targets, we treat“V ADV V” also as one consecutive span. In casethere is no verb in a given span, we chose one of thenouns as the target, but only if the span starts witha discourse connective. This condition allows usto not include nouns as targets that are simply partof enumerations, while at the same time it helpsinclude eventive nouns, see b) for an example. Toimprove precision (by 0.02) we also excluded thefollowing verbs “said, according, spoke”.

With these heuristics we achieve a Recall of98.4 and a Precision of 57.4 compared with thediscourse relations in Sec. 22 of the PDTB.

Cost Details The basic cost for each sentencewas 18¢, with a bonus of 3¢ for creating a secondQA pair and then a bonus of 4¢ for every additionalQA pair after the first two. Adjudication was re-warded with 10¢ per sentence. On average 50.3¢were spent per sentence of Dev and Test, with anaverage of 2.11 QA pairs per sentence. For Trainthe average cost per sentence is about 37.1¢, withan average of 1.72 QAs.

Annotation Interfaces The following screen-shots display the Data Collection and Adjudicationinterfaces.

Figure 1: Interface for the Question Generation step.

Figure 2: Interface for the Answer Generation step.

Figure 3: Interface for the Adjudication step.

A.3 Data Examples

Page 13: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2816

Sentence Question AnswerAn inquest found he’d committed suicide, but some disputethis and believe it was an accident .

Instead of what do some believe it wasan accident?

suicide

On Sunday, in a video posted on YouTube, Anonymousannounced their intentions saying, “From the time youhave received this message, our attack protocol has pastbeen executed and your downfall is underway.”

Since when has our attack protocol pastbeen executed and your downfall is un-derway?

From the time youhave received thismessage

It is unclear why this diet works. Despite what does this diet work? Being unclear whyIt’s a downgraded budget from a downgraded Chancellor[...] Debt is higher in every year of this Parliament than heforecast at the last Budget.

What is similar to it’s a downgraded bud-get?

It’s a downgradedChancellor

According to Pakistani Rangers, the firing from India wasunprovoked in both Sunday and Wednesday incidents;Punjab Rangers in the first incident, and Chenab Rangersin the second incident, retaliated with intention to stop thefiring.

What is the reason punjab Rangers andChenab Rangers retaliated?

with intention tostop the firing

The vessel split in two and is leaking fuel oil . After what did the vessel leak fuel oil? The vessel split intwo

In contrast to the predictions of the Met Office, the En-vironment Agency have said that floods could remain insome areas of England until March, and that up to 3,000homes in the Thames Valley could be flooded over theweekend.

What is contrasted with the predictionsof the Met Office?

the EnvironmentAgency have saidthat floods couldremain in some ar-eas of England untilMarch, and that upto 3,000 homes inthe Thames Valleycould be floodedover the weekend

Table 14: Examples for QA pairs that were annotated in the dataset.

Sentence Question AnswerStandard addition can be applied to most analytical tech-niques and is used instead of a calibration curve to solvethe matrix effect problem.

Instead of what is standard additionused?

a calibration curve

State officials therefore share the same interests as ownersof capital and are linked to them through a wide array ofsocial, economic, and political ties.

What is similar to state officials share thesame interests as owners of capital?

are linked to themthrough a wide arrayof social, economic,and political ties

Recently, this field is rapidly progressing because of therapid development of the computer and camera industries.

What is the reason this field is rapidlyprogressing?

Because of the rapiddevelopment of thecomputer and cam-era industries

Civilization was the product of the Agricultural NeolithicRevolution; as H. G. Wells put it, “civilization was theagricultural surplus.”

In what manner was civilization the prod-uct of the Agricultural Neolithic Revolu-tion?

civilization was theagricultural surplus

The portrait shows such ruthlessness in Innocent’s expres-sion that some in the Vatican feared that Velazquez wouldmeet with the Pope’s displeasure, but Innocent was wellpleased with the work, hanging it in his official visitor’swaiting room.

Despite what was Innocent well pleasedwith The work?

The portrait showssuch ruthlessness inInnocent’s expres-sion that some inthe Vatican fearedthat Velazquezwould meet with thePope’s displeasure

All tropical cyclones lose strength once they make landfall. After what do tropical cyclones losestrength?

once they makelandfall

The investigation, led by former Dutch General PatrickCammaert , is separate from the investigation led by theUN’s Human Rights Council.

What is contrasted with the investiga-tion, led by former Dutch General PatrickCammaert?

the investigation ledby the UN’s HumanRights Council

Table 15: Examples for QA pairs that were predicted with the full pipeline.

Page 14: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2817

Sentence Question Answer PDTB sensesIt was “the Soviets’ Vietnam.” The Kabulregime would fall.

After what wouldthe Kabul regimefall?

after the Soviets’Vietnam

Expansion.Conjunction

Eight months after Gen. Boris Gro-mov walked across the bridge into theU.S.S.R., a Soviet-controlled regime re-mains in Kabul, the refugees sit in theircamps, and the restoration of Afghanfreedom seems as far off as ever.

What is the rea-son the restorationof Afghan freedomseems as far off asever?

a Soviet-controlledregime remains inKabul

Temporal.Asynchronous.Succession, Ex-pansion.Conjunction

Soviet leaders said they would sup-port their Kabul clients by all meansnecessary–and did.

In what mannerwould soviet leaderssupport their Kabulclients?

soviet leaders saidthey would supporttheir kabul clients byall means necessary

Expansion.Conjunction

Soviet leaders said they would sup-port their Kabul clients by all meansnecessary–and did.

After what did So-viet leaders supporttheir Kabul clientsby all means neces-sary?

after soviet leaderssaid they would

Expansion.Conjunction

With the February 1987 U.N. accords “re-lating to Afghanistan,” the Soviet Uniongot everything it needed to consolidatepermanent control.

In what manner didthe Soviet Union geteverything it neededto consolidate per-manent control?

with the February1987 u.n. ac-cords “relating toAfghanistan,”

Contingency.Cause.Reason,Contingency.Purpose.Arg2-as-goal

The terms of the Geneva accords leaveMoscow free to provide its clients inKabul with assistance of any kind–including the return of Soviet groundforces–while requiring the U.S. and Pak-istan to cut off aid.

What is the resultof the terms of theGeneva accords?

leaving Moscowfree to provide itsclients in Kabulwith assistance ofany kind whilerequiring the U.S.and Pakistan to cutoff aid

Temporal.Synchronous, Compari-son.Contrast

The only fly in the Soviet ointment wasthe last-minute addition of a unilateralAmerican caveat, that U.S. aid to the re-sistance would continue as long as Sovietaid to Kabul did.

What is the reasonfor the only fly in theSoviet ointment?

the last-minute addi-tion of a unilateralAmerican caveat,that U.S. aid to theresistance wouldcontinue as long asSoviet aid to Kabuldid

Expansion.Level-of-detail.Arg2-as-detail, Contingency.Condition.Arg2-as-cond, Temporal.Synchronous

But as soon as the accords were signed,American officials sharply reduced aid.

In what manner didAmerican officialsreduce aid?

American officialssharply reduced aid

Temporal.Asynchronous.Succession

Moscow claims this is all needed to pro-tect the Kabul regime against the guer-rilla resistance.

What is the reasonMoscow claims thisis all needed?

to protect the Kabulregime against theguerrilla resistance

Contingency.Condition.Arg2-as-cond

But this is not the entire Afghan army,and it is no longer Kabul’s only militaryforce.

What is similar to itnot being the entireAfghan army?

is no longer Kabul’sonly military force.

Expansion.Conjunction

The deal fell through, and Kandahar re-mains a major regime base.

After what did Kan-dahar remain a ma-jor regime base?

after the deal fellthrough

Contingency.Cause.Result, Expan-sion.Conjunction

The deal fell through, and Kandahar re-mains a major regime base.

Since when doesKandahar remain amajor regime base?

since the deal fellthrough

Contingency.Cause.Result, Expan-sion.Conjunction

The wonder is not that the resistance hasfailed to topple the Kabul regime, but thatit continues to exist and fight at all.

Despite what is thewonder that it con-tinues to exist andfight at all?

despite the resis-tance failing totopple the kabulregime

Comparison.Contrast,Expansion.Substitution.Arg2-as-subst

Last summer, in response to congres-sional criticism, the State Departmentand the CIA said they had resumed mili-tary aid to the resistance months after itwas cut off; but it is not clear how muchis being sent or when it will arrive.

what is the result ofcongressional criti-cism last summer?

the state departmentand the CIA saidthey had resumedmilitary aid to the re-sistance

Temporal.Asynchronous.Succession,Comparison.Concession.Arg2-as-denier, Expansion.Conjunction

Table 16: Examples for additional relations expressed through QA pairs that do not appear in the PDTB, Part 1.

Page 15: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2818

Sentence Question Answer PDTB sensesBeyond removing a competitor, the com-bination should provide “ synergies,”said Fred Harlow, Unilab’s chief finan-cial officer.

While what shouldthe combination pro-vide synergies?

removing a competi-tor.

Expansion.Conjunction

In Los Angeles, for example, Central hashad a strong market position while Uni-lab’s presence has been less prominent,according to Mr. Harlow.

In what case hasCentral had a strongmarket positionwhile Unilab’spresence has beenless prominent?

in Los Angeles Comparison.Contrast, Tempo-ral.Synchronous

A Daikin executive in charge of exportswhen the high-purity halogenated hydro-carbon was sold to the Soviets in 1986received a suspended 10-month jail sen-tence.

What is the resultof the high-purityhalogenated hydro-carbon being sold tothe Soviets in 1986?

a Daikin executivein charge of exportsreceived a sus-pended 10-monthjail sentence

Temporal.Synchronous

In Los Angeles, for example, Central hashad a strong market position while Uni-lab’s presence has been less prominent,according to Mr. Harlow.

in what case hascentral had a strongmarket positionwhile Unilab’spresence has beenless prominent?

in Los Angeles Comparison.Contrast, Tempo-ral.Synchronous

Mr. Mehl noted that actual rates arealmost identical on small and large-denomination CDs, but yields on CDsaimed at the individual investor areboosted by more frequent compounding.

In what manner areyields on CDs aimedat the individual in-vestor boosted?

by more frequentcompounding

Comparison.Concession.Arg2-as-denier

Judge Masaaki Yoneyama told the OsakaDistrict Court Daikin’s “responsibility isheavy because illegal exports lowered in-ternational trust in Japan.” Sale of thesolution in concentrated form to Commu-nist countries is prohibited by Japaneselaw and by international agreement.

Except when is thesolution in concen-trated form sold?

except to commu-nist countries

Contingency.Cause.Reason, EntRel

Japan has supported a larger role for theIMF in developing-country debt issues,and is an important financial resourcefor IMF-guided programs in developingcountries.

In what case isJapan an importantfinancial resourcefor imf-guidedprograms?

in developing coun-tries

Expansion.Conjunction

Japan has supported a larger role for theIMF in developing-country debt issues,and is an important financial resourcefor IMF-guided programs in developingcountries.

While what hasJapan supported alarger role for theIMF in developing-country debt issues?

while it is an im-portant financialresource for imf-guided programsin developingcountries

Expansion.Conjunction

The last U.S. congressional authoriza-tion, in 1983, was a political donnybrookand carried a $6 billion housing programalong with it to secure adequate votes.

What is an exam-ple of something be-ing a political don-nybrook?

the last u.s. congres-sional authorization,in 1983

Contingency.Purpose.Arg2-as-goal, Ex-pansion.Conjunction

Instead, the tests will focus heavily onnew blends of gasoline, which are stillundeveloped but which the petroleum in-dustry has been touting as a solution forautomobile pollution that is choking ur-ban areas.

What is the reasontests will focus heav-ily on new blends ofgasoline?

the petroleum indus-try has been toutingas a solution for au-tomobile pollution

Comparison.Concession.Arg2-as-denier

While major oil companies have been ex-perimenting with cleaner-burning gaso-line blends for years, only Atlantic Rich-field Co. is now marketing a lower-emission gasoline for older cars currentlyrunning on leaded fuel.

While what isAtlantic Richfieldco. marketing alower-emissiongasoline for oldercars currentlyrunning on leadedfuel?

while major oil com-panies have beenexperimenting withcleaner-burninggasoline blends

Comparison.Contrast

Table 17: Examples for additional relations expressed through QA pairs that do not appear in the PDTB, Part 2.

Page 16: QADiscourse - Discourse Relations as QA Pairs ...Discourse Relations Discourse Relations in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008;Webber et al.,2019), as seen in

2819

Sentence Question Answer PDTB sensesInstead, a House subcommittee adopteda clean-fuels program that specificallymentions reformulated gasoline as an al-ternative.

What is the result ofa house subcommit-tee adopting a clean-fuels program?

reformulated gaso-line as an alterna-tive.

Expansion.Level-of-detail.Arg2-as-detail

The Bush administration has said it willtry to resurrect its plan when the HouseEnergy and Commerce Committee takesup a comprehensive clean-air bill.

In what case will theBush administrationtry to resurrect itsplan?

when the house en-ergy and commercecommittee takes upa comprehensiveclean-air bill

Temporal.Synchronous

That compares with per-share earningsfrom continuing operations of 69 centsthe year earlier; including discontinuedoperations, per-share was 88 cents a yearago.

In what manner doesthat compare withper-share earningsfrom continuing op-erations of 69 centsthe year earlier?

including discontin-ued operations, per-share was 88 cents ayear ago.

Comparison.Contrast

Analysts estimate Colgate’s sales ofhousehold products in the U.S. were flatfor the quarter, and they estimated oper-ating margins at only 1% to 3%

While what did an-alysts estimate Col-gate’s sales of house-hold products in theU.S. were flat for thequarter?

they estimated op-erating margins atonly 1% to 3%

Expansion.Conjunction

Analysts estimate Colgate’s sales ofhousehold products in the U.S. were flatfor the quarter, and they estimated oper-ating margins at only 1% to 3%

After what did an-alysts estimate Col-gate’s sales of house-hold products in theU.S. were flat?

after the quarter Expansion.Conjunction

The programs will be written and pro-duced by CNBC, with background andresearch provided by staff from U.S.News

What is similar tothe programs beingwritten by CNBC?

being produced byCNBC

Expansion.Conjunction

The programs will be written and pro-duced by CNBC, with background andresearch provided by staff from U.S.News

In what manner willbackground and re-search be providedfor the programs?

by staff from U.S.news

Expansion.Conjunction

The programs will be written and pro-duced by CNBC, with background andresearch provided by staff from U.S.News

In what mannerwill the programsbe written andproduced?

the programs willbe written andproduced by CNBC,with backgroundand research pro-vided by staff fromU.S. news

Expansion.Conjunction

Frank Carlucci III was named to thistelecommunications company’s board,filling the vacancy created by the deathof William Sobey last May.

After what wasFrank CarlucciIII named to thistelecommunica-tions companysboard?

the death of WilliamSobey last may

Contingency.Cause.Result

Weyerhaeuser’s pulp and paper opera-tions were up for the nine months, butfull-year performance depends on the bal-ance of operating and maintenance costs,plus pricing of certain products, the com-pany said.

What is contrastedwith full-year per-formance of Weyer-haeuser’s pulp andpaper operations?

nine month perfor-mance

Comparison.Concession.Arg2-as-denier

Weyerhaeuser’s pulp and paper opera-tions were up for the nine months, butfull-year performance depends on the bal-ance of operating and maintenance costs,plus pricing of certain products, the com-pany said.

What is the result ofWeyerhaeuser’s full-year performance?

depends on thebalance of operatingand maintenancecosts, plus pricingof certain products.

Comparison.Concession.Arg2-as-denier

Table 18: Examples for additional relations expressed through QA pairs that do not appear in the PDTB, Part 3.