12
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 794–805, November 16–20, 2020. c 2020 Association for Computational Linguistics 794 Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning Lianhui Qin †‡ Vered Shwartz †‡ Peter West †‡ Chandra Bhagavatula Jena D. Hwang Ronan Le Bras Antoine Bosselut †‡ Yejin Choi †‡ Paul G. Allen School of Computer Science & Engineering, University of Washington Allen Institute for Artificial Intelligence {lianhuiq, pawest, antoineb, yejin}@cs.washington.edu {vered, chandrab, jenah, ronanlb}@allenai.org Abstract Abductive and counterfactual reasoning, core abilities of everyday human cognition, require reasoning about what might have happened at time t, while conditioning on multiple contexts from the relative past and future. However, simultaneous incorporation of past and future contexts using generative language models (LMs) can be challenging, as they are trained either to condition only on the past context or to perform narrowly scoped text-infilling. In this paper, we propose DELOREAN, a new unsupervised decoding algorithm that can flex- ibly incorporate both the past and future con- texts using only off-the-shelf, left-to-right lan- guage models and no supervision. The key in- tuition of our algorithm is incorporating the fu- ture through back-propagation, during which, we only update the internal representation of the output while fixing the model parameters. By alternating between forward and backward propagation, DELOREAN can decode the out- put representation that reflects both the left and right contexts. We demonstrate that our approach is general and applicable to two nonmonotonic reasoning tasks: abductive text generation and counterfactual story revision, where DELOREAN outperforms a range of unsupervised and some supervised methods, based on automatic and human evaluation. 1 1 Introduction Everyday causal reasoning requires reasoning about the likely explanations to partially observ- able past and future (abductive reasoning (Peirce, 1960)) and reasoning about the alternative future based on counterfactual past (counterfactual rea- soning). Such nonmonotonic reasoning requires 1 Code is available at https://github.com/ qkaren/unsup_gen_for_cms_reasoning She hit the rope and the tire fell on top of her. Abductive Reasoning Ray hung a tire on a rope to make his daughter a swing. Past Observation Ray ran to his daughter to make sure she was okay. Future Observation Original Ending Zeke thought about being a vampire or a wizard. Then he decided on a scarier costume. Zeke dressed up like a skeleton. Zeke thought about Lannister, but he didn’t want to look like a Lannister. He wanted to look like a Stark. Zeke dressed up like a Stark. Story Context Zeke was throwing a party. All his friends were dressing up for this Halloween party. All his friends were dressing up for this Game of Thrones themed party. [C ounterfactual] DELOREAN Rewritten Ending Hypothesis Counterfactual Reasoning Figure 1: DELOREAN, our proposed method, with gen- erated reasoning results. Top: the goal in abductive reasoning is to generate a hypothesis (Y ) of what hap- pened between the observed past (X) and future (Z ) contexts. Bottom: In counterfactual reasoning, given a story context altered by a counterfactual condition, X, and the original ending Z , the goal is to generate a new ending Y which is coherent with X while remain- ing similar to Z . The story from TIMETRAVEL (Qin et al., 2019a) consists of five sentences. Our approach alternates forward (left-to-right) and backward (right- to-left) passes that iteratively refine the generated texts w.r.t context from each side. inferring plausible but potentially defeasible con- clusions from incomplete or hypothetical observa- tions (Reiter, 1988). While humans are remarkably good at this type of causal reasoning, developing AI systems capable of nonmonotonic reasoning for

Back to the Future: Unsupervised Backprop-based Decoding for … · 2020. 11. 11. · Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 794–805,November 16–20, 2020. c©2020 Association for Computational Linguistics

    794

    Back to the Future: Unsupervised Backprop-based Decoding forCounterfactual and Abductive Commonsense Reasoning

    Lianhui Qin†‡ Vered Shwartz †‡ Peter West †‡ Chandra Bhagavatula‡Jena D. Hwang ‡ Ronan Le Bras ‡ Antoine Bosselut †‡ Yejin Choi†‡

    †Paul G. Allen School of Computer Science & Engineering, University of Washington‡Allen Institute for Artificial Intelligence

    {lianhuiq, pawest, antoineb, yejin}@cs.washington.edu{vered, chandrab, jenah, ronanlb}@allenai.org

    Abstract

    Abductive and counterfactual reasoning, coreabilities of everyday human cognition, requirereasoning about what might have happened attime t, while conditioning on multiple contextsfrom the relative past and future. However,simultaneous incorporation of past and futurecontexts using generative language models(LMs) can be challenging, as they are trainedeither to condition only on the past context orto perform narrowly scoped text-infilling.

    In this paper, we propose DELOREAN, a newunsupervised decoding algorithm that can flex-ibly incorporate both the past and future con-texts using only off-the-shelf, left-to-right lan-guage models and no supervision. The key in-tuition of our algorithm is incorporating the fu-ture through back-propagation, during which,we only update the internal representation ofthe output while fixing the model parameters.By alternating between forward and backwardpropagation, DELOREAN can decode the out-put representation that reflects both the leftand right contexts. We demonstrate that ourapproach is general and applicable to twononmonotonic reasoning tasks: abductive textgeneration and counterfactual story revision,where DELOREAN outperforms a range ofunsupervised and some supervised methods,based on automatic and human evaluation.1

    1 Introduction

    Everyday causal reasoning requires reasoningabout the likely explanations to partially observ-able past and future (abductive reasoning (Peirce,1960)) and reasoning about the alternative futurebased on counterfactual past (counterfactual rea-soning). Such nonmonotonic reasoning requires

    1Code is available at https://github.com/qkaren/unsup_gen_for_cms_reasoning

    She hit the rope and the tire fell on top of her.

    Abductive Reasoning

    Ray hung a tire on a rope to make his daughter a swing.

    Past Observation

    Ray ran to his daughter to make sure she was okay.

    Future Observation

    Original Ending

    Zeke thought about being a vampire or a wizard.

    Then he decided on a scarier costume.

    Zeke dressed up like a skeleton.

    Zeke thought about Lannister, but he didn’t want to look like a Lannister.

    He wanted to look like a Stark.

    Zeke dressed up like a Stark.

    Story Context

    Zeke was throwing a party.

    All his friends were dressing up for this Halloween party.

    All his friends were dressing up for this Game of Thrones themed party. [Counterfactual]

    DELORE

    AN

    Rewritten Ending

    Hypothesis

    Counterfactual Reasoning

    Figure 1: DELOREAN, our proposed method, with gen-erated reasoning results. Top: the goal in abductivereasoning is to generate a hypothesis (Y ) of what hap-pened between the observed past (X) and future (Z)contexts. Bottom: In counterfactual reasoning, givena story context altered by a counterfactual condition,X , and the original ending Z, the goal is to generate anew ending Y which is coherent with X while remain-ing similar to Z. The story from TIMETRAVEL (Qinet al., 2019a) consists of five sentences. Our approachalternates forward (left-to-right) and backward (right-to-left) passes that iteratively refine the generated textsw.r.t context from each side.

    inferring plausible but potentially defeasible con-clusions from incomplete or hypothetical observa-tions (Reiter, 1988). While humans are remarkablygood at this type of causal reasoning, developingAI systems capable of nonmonotonic reasoning for

    https://github.com/qkaren/unsup_gen_for_cms_reasoninghttps://github.com/qkaren/unsup_gen_for_cms_reasoning

  • 795

    a wide range of situations describable in naturallanguage has been a major open research question.

    More concretely, with abductive reasoning, thegoal is to find the most plausible explanation forincomplete observations (Peirce, 1960). In the toppart of Figure 1, given the first observation thatRay is “making his daughter a swing” and the laterobservation that he “ran to [her] to make sure shewas okay,” we can hypothesize that she somehowgot hurt by the swing.

    In contrast, counterfactual reasoning concernsthe causal changes to future events given a changein the past condition (i.e., “counterfactual condi-tion”; Goodman, 1947). For example, the bottompart of Figure 1 shows the original five sentencestory (S1, ..., S5) and an alternative counterfac-tual condition given in S′2—that instead of beinga generic “Halloween party”, the new counterfac-tual condition is that it is going to be a “Gameof Thrones themed party”! Given these, the prob-lem we want to solve is to update the future events(S′3, ..., S

    ′5), so that instead of “Zeke dressed up as

    skeleton”, we have “Zeke dressed up like a Stark”.2

    Recently, two tasks and corresponding bench-marks have been introduced to tackle language-based nonmonotonic reasoning: the ART datasetfor abductive NLG (Bhagavatula et al., 2019), andthe TIMETRAVEL dataset for counterfactual storyrewriting (Qin et al., 2019a). Both tasks are framedas conditional generation, with multiple contextsto condition on. The currently dominant paradigmfor conditional text generation tasks is fine-tuningpre-trained language models (LMs), such as GPT2(Radford et al., 2019a), on large-scale training datafor supervision. However, despite the large num-ber of training examples, supervised approachesstill perform considerably worse than humans andare subject to developing superficial strategies suchas repeating the observations as is or memorizingprevalent surface patters specific in the dataset (Qinet al., 2019a). Furthermore, having to require large-scale training data for each domain and task wouldbe utterly inefficient for broad-coverage nonmono-tonic reasoning in language.

    In this paper, we investigate an alternative pathtoward language-based nonmonotonic reasoningusing pre-trained language models as is. Intuitively,both the abductive and counterfactual reasoning

    2“Lannister” in S′3 and “Stark” in S′4 and S′5 refer to char-acter names in the TV show, “Game of the Thrones.” All theoutput text shown in Figure 1 is the actual system output fromDELOREAN.

    requires learning coherent patterns in narrative,which should be already available in large-scalepretrained language models. However, the key chal-lenge is that most generative language models aretrained to condition only on the left context, or toperform narrowly scoped text-infilling.

    This paper presents DELOREAN: DEcoding fornonmonotonic LOgical REAsoNing, an unsuper-vised decoding algorithm that only assumes off-the-shelf left-to-right language models with no supervi-sion. The key intuition of our algorithm is incorpo-rating the future through back-propagation, duringwhich, we only update the internal representationof the output while fixing the model parameters.More specifically, DELOREAN alternates betweenthe forward and backward passes, where the for-ward pass performs left-to-right inference giventhe left context (roughly maximizing P (Y |X) inFigure 1), while the backward pass instills theright constraint through right-to-left backpropaga-tion with a task-specific loss (roughly maximizingP (Z|XY )). The forward and backward outputsare mixed into a single vector, from which tokensare sampled to generate the desired output. Tochoose the best output across iterations, we employan unsupervised ranking step based on BERT’snext sentence prediction task to measure coherence(Devlin et al., 2018).

    On both tasks, DELOREAN outperforms all otherunsupervised methods in terms of both automaticmetrics and human evaluation, demonstrating thatnonmonotonic reasoning through conditional de-coding is a promising research direction. Moreover,outputs produced by our model are judged as morecoherent than those from the supervised models. Insum, our study shows that backpropagation-baseddecoding may enable additional future applicationsof unsupervised generation and reasoning.

    2 Background

    Most NLP benchmarks have focused on reason-ing about information that is entailed from thepremise. For instance, natural language infer-ence (NLI; Bowman et al., 2015) focuses primarilyon whether a hypothesis is entailed from a givenpremise, which means the information stated in thehypothesis is a subset of the information providedin the premise. However, it has been noted thathuman reasoning is often the other way, where hy-potheses often contain new information that wasnot available in the premise, but plausibly true (but

  • 796

    …ỹ f1 ỹ

    f2 ỹ

    fN

    ỹb1 ỹb2 ỹbN

    ỹ2 ỹN…

    Ray ran [S]…

    Computing Loss

    Backpropagation

    Past context X

    Input: Ray hung a tire on a rope to make his daughter a swing.

    x1 x2 xNX…

    Future constraint Z

    Input: Ray ran to his daughter to make sure she was okay.

    z1 z2 zNZ…

    to

    LM

    Repeat 
T times

    Generation Y

    ỹ1

    Output: She hit the rope and the tire fell on top of her.

    LM Forward PassLM Backward Pass

    Forward Backward MixForward LogitsBackward Logits

    LM

    x1 xNX…

    …ỹ1 ỹ2 ỹN

    x2

    Initialization

    Figure 2: Illustration of the DELOREAN decoding procedure, using abductive reasoning as an example. At ini-tialization (upper-left box), the language model (LM) initializes the logits

    …ỹ f1 ỹ

    f2 ỹ

    fNỹ

    f3

    ỹb1 ỹb2 ỹbNỹ

    b3

    ỹ2 ỹNỹ3…

    Ray ran [S]…

    Computing Loss

    Backpropagation

    Past context X

    Input: Ray hung a tire on a rope to make his daughter a swing.

    x1 x2 xNX…

    Future constraint Z

    Input: Ray ran to his daughter to make sure she was okay.

    z1 z2 zNZ…

    to

    LM

    Repeat 
T times

    Generation Y

    ỹ1

    Output: She hit the rope and the tire fell on top of her.

    LM Forward PassLM Backward Pass

    Forward Backward MixForward LogitsBackward Logits

    Ỹ = {ỹ1, . . . , ỹN} of the hypothesisby reading the past context X and generating a continuation with regular decoding. At each forward-backwarditeration, we compute the task-specific loss LỸ of the logits

    …ỹ f1 ỹ

    f2 ỹ

    fNỹ

    f3

    ỹb1 ỹb2 ỹbNỹ

    b3

    ỹ2 ỹNỹ3…

    Ray ran [S]…

    Computing Loss

    Backpropagation

    Past context X

    Input: Ray hung a tire on a rope to make his daughter a swing.

    x1 x2 xNX…

    Future constraint Z

    Input: Ray ran to his daughter to make sure she was okay.

    z1 z2 zNZ…

    to

    LM

    Repeat 
T times

    Generation Y

    ỹ1

    Output: She hit the rope and the tire fell on top of her.

    LM Forward PassLM Backward Pass

    Forward Backward MixForward LogitsBackward Logits

    based on the future constraint Z (red box). Thebackward pass then performs back-propagation and produces the backward logits

    …ỹ f1 ỹ

    f2 ỹ

    fNỹ

    f3

    ỹb1 ỹb2 ỹbNỹ

    b3

    ỹ2 ỹNỹ3…

    Ray ran [S]…

    Computing Loss

    Backpropagation

    Past context X

    Input: Ray hung a tire on a rope to make his daughter a swing.

    x1 x2 xNX…

    Future constraint Z

    Input: Ray ran to his daughter to make sure she was okay.

    z1 z2 zNZ…

    to

    LM

    Repeat 
T times

    Generation Y

    ỹ1

    Output: She hit the rope and the tire fell on top of her.

    LM Forward PassLM Backward Pass

    Forward Backward MixForward LogitsBackward Logits

    Ỹ b = {ỹb1, . . . , ỹbN}. In thesubsequent forward pass, for each step n, we compute the forward logits

    …ỹ f1 ỹ

    f2 ỹ

    fNỹ

    f3

    ỹb1 ỹb2 ỹbNỹ

    b3

    ỹ2 ỹNỹ3…

    Ray ran [S]…

    Computing Loss

    Backpropagation

    Past context X

    Input: Ray hung a tire on a rope to make his daughter a swing.

    x1 x2 xNX…

    Future constraint Z

    Input: Ray ran to his daughter to make sure she was okay.

    z1 z2 zNZ…

    to

    LM

    Repeat 
T times

    Generation Y

    ỹ1

    Output: She hit the rope and the tire fell on top of her.

    LM Forward PassLM Backward Pass

    Forward Backward MixForward LogitsBackward Logits

    ỹfn conditioning on the preceding logits

    …ỹ f1 ỹ

    f2 ỹ

    fNỹ

    f3

    ỹb1 ỹb2 ỹbNỹ

    b3

    ỹ2 ỹNỹ3…

    Ray ran [S]…

    Computing Loss

    Backpropagation

    Past context X

    Input: Ray hung a tire on a rope to make his daughter a swing.

    x1 x2 xNX…

    Future constraint Z

    Input: Ray ran to his daughter to make sure she was okay.

    z1 z2 zNZ…

    to

    LM

    Repeat 
T times

    Generation Y

    ỹ1

    Output: She hit the rope and the tire fell on top of her.

    LM Forward PassLM Backward Pass

    Forward Backward MixForward LogitsBackward Logits

    ỹn−1, and then mix it with the respective backward logits

    …ỹ f1 ỹ

    f2 ỹ

    fNỹ

    f3

    ỹb1 ỹb2 ỹbNỹ

    b3

    ỹ2 ỹNỹ3…

    Ray ran [S]…

    Computing Loss

    Backpropagation

    Past context X

    Input: Ray hung a tire on a rope to make his daughter a swing.

    x1 x2 xNX…

    Future constraint Z

    Input: Ray ran to his daughter to make sure she was okay.

    z1 z2 zNZ…

    to

    LM

    Repeat 
T times

    Generation Y

    ỹ1

    Output: She hit the rope and the tire fell on top of her.

    LM Forward PassLM Backward Pass

    Forward Backward MixForward LogitsBackward Logits

    to produce the new logits

    …ỹ f1 ỹ

    f2 ỹ

    fNỹ

    f3

    ỹb1 ỹb2 ỹbNỹ

    b3

    ỹ2 ỹNỹ3…

    Ray ran [S]…

    Computing Loss

    Backpropagation

    Past context X

    Input: Ray hung a tire on a rope to make his daughter a swing.

    x1 x2 xNX…

    Future constraint Z

    Input: Ray ran to his daughter to make sure she was okay.

    z1 z2 zNZ…

    to

    LM

    Repeat 
T times

    Generation Y

    ỹ1

    Output: She hit the rope and the tire fell on top of her.

    LM Forward PassLM Backward Pass

    Forward Backward MixForward LogitsBackward Logits

    ỹn at step n.

    possibly defeasible with new additional context)(Johnson-Laird, 2006; Mercier and Sperber, 2017).This type of reasoning corresponds to nonmono-tonic reasoning (Kraus et al., 1990), as it contra-dicts the monotonicity property according to whichvalid arguments cannot be made invalid by addingpremises. We study two tasks of that nature: abduc-tive reasoning (§2.1) and counterfactual reasoning(§2.2).

    2.1 Abductive Reasoning

    Abductive reasoning aims at finding the most likelyexplanation to partial observations (Peirce, 1960).It has a central role in the human ability to “read be-tween the lines,” and is crucial for language acqui-sition (Andersen, 1973), understanding sentencesin discourse (Hobbs et al., 1993), and many more.Despite the importance, however, relatively littlefocus has been given to it in NLP research.

    Recently, Bhagavatula et al. (2019) propose the

    abductive reasoning task. Given two observations,the goal is to determine the most likely explana-tion of what happened in-between. The datasetintroduced for the task, ART, consists of 20k obser-vations derived from the first and last sentence ofstories in the ROCStories dataset (Mostafazadehet al., 2016a). We focus on the abductive NLGsetup introduced in the paper, which is framed as aconditional generation task where a plausible expla-nation to the observations must be generated usinglanguage. The authors reported the performance ofseveral pre-trained LM-based baselines and showedpromises and limitations of such approaches.

    2.2 Counterfactual Reasoning

    Counterfactual reasoning aims at inferring alterna-tive past events that could have happened givena certain change in conditions (Goodman, 1947;Starr, 2019). While counterfactual reasoning playsan important role in AI systems (Isard, 1974; Gins-

  • 797

    berg, 1986), it requires causal reasoning abilities,which are arguably absent from current association-based AI (Pearl and Mackenzie, 2018). Whilethere has been work on counterfactual reasoningin NLP, including recognizing counterfactuals intext (Son et al., 2017), and improving the perfor-mance of NLP tasks using counterfactual learn-ing (Lawrence et al., 2017; Lawrence and Riezler,2018), it remains a major research challenge.

    Recently, Qin et al. (2019a) introduce the task ofcounterfactual story generation. Given a 5-sentenceoriginal story, and an alternative context in whichthe second sentence of the story was altered bya counterfactual, the task is to generate a new 3-sentence story ending that addresses the alternativebeginning while minimally editing the original end-ing. The associated TIMETRAVEL dataset is basedon fictional narratives from ROCStories, for whichcounterfactual contexts and alternative endings arecrowdsourced, yielding 29,849 problem instances.Qin et al. (2019a) report several baseline perfor-mances, and find that models based on pre-trainedLMs produce output that recognize the counterfac-tual, but generated endings which deviated consid-erably from the original storyline. In contrast, inthe supervised setup, models optimize the easier ofthe two goals and generate endings that are overlysimilar to the original endings.

    3 The DELOREAN Approach

    Humans make inferences based on available in-formation and refine them when new informationarrives. Since currently available pre-trained LMsgenerate text by sequentially predicting the nexttoken from left to right, they are incapable of con-ditioning on future constraints. Therefore, we pro-pose DELOREAN: an unsupervised backprop-baseddecoding algorithm, which is summarized in Algo-rithm 1, illustrated in Figure 2, and detailed below.DELOREAN intermittently refines the predictionsto cohere with either the context or the constraints(Section 3.1). The candidate generations are thenranked by coherence (Section 3.2).

    3.1 Decoding Strategy

    Given context textX , the goal is to generate contin-uation text Y = (y1, . . . , yN ), such that Y satisfiescertain constraints according to the reasoning tasks,usually defined based on another context Z (seeFigure 1; we discuss the task-specific constraintsin the respective task sections).

    Algorithm 1: DELOREAN DecodingInput: Pre-trained language model (LM)

    Context XFuture constraint Z

    1: Initialize logits Ỹ (0)

    2: Initialize Ys, list of candidate generations3: for t← 1 to T do4: // Backward pass5: for n← N to 1 do6: Compute backward logits ỹbn, Eq.(1)7: end for8: // Forward pass9: for n← 1 to N do

    10: Compute forward logits ỹfn, Eq.(2)11: Mix forward and backward logits, Eq.(3)12: end for13: Sample candidate Y from logits Ỹ and add to Ys14: end for15: Rank Ys by coherenceOutput: The most coherent generated text Y from Ys

    The proposed approach interleaves two proce-dures, namely, forward and backward, that produceand iteratively refine the generation, for a prede-fined number of iterations T . In particular, theforward pass ensures the generated text is a fluentcontinuation of the context X , while the backwardpass informs the model about the constraint andsteers the generation to satisfy it.

    As detailed below, the backward pass uses gradi-ent descent to update the generation Y . However,Y is a discrete text that is not differentiable. In-stead, throughout the algorithm, we maintain a softrepresentation of the sequence Ỹ = (ỹ1, . . . , ỹN ),where ỹn ∈ RV represents the logits of the n-thtoken and V is the vocabulary size. After the logitsare refined over multiple iterations of the forwardand backward passes, we generate discrete text ateach step by sampling from yn ∼ softmax(ỹn/τ),where τ > 0 is the temperature.

    We start by initializing the logits before the firstiteration, Ỹ (0) = (ỹ(0)1 . . . ỹ

    (0)N ), by feeding the

    context X into the LM and greedily decoding Ncontinuation tokens.

    Backward The backward pass uses gradientbackpropagation to update the generation withrespect to the constraint. Specifically, we ex-press the task-specific constraint as a loss functionL(X, Ỹ (t−1), Z) that evaluates how well the gener-ation Y (approximated with the soft representationỸ ) obeys the constraint (see the subsequent sec-tions for concrete instantiations of the loss). Thegoal of this pass is thus to minimize the loss w.r.tthe generation. Specifically, at iteration t, for each

  • 798

    step n in the generation, we update its logits with:

    ỹ(t),bn = ỹ

    (t−1)n − λ · ∇ỹnL(X, Ỹ (t−1), Z), (1)

    where ∇ỹnL(X, Ỹ (t−1), Z) is the gradient of theconstraint-informed loss L w.r.t the n-th logits, andλ ∈ R is the step size. In practice, we may repeatthe gradient updates multiple times in a single pass.

    Forward The forward pass ensures that Y is flu-ent and coherent with the preceding context X . Atiteration t, for a particular step n, we compute theforward logits with the LM:

    ỹ(t),fn = LM(X, Ỹ(t)1:n−1). (2)

    We then mix the nth-step forward and backwardlogits to get the final logits of iteration t:

    ỹ(t)n = γ · ỹ(t),fn + (1− γ) · ỹ(t),bn , (3)

    where 0 < γ < 1 is the mixing weight. The result-ing logits ỹ(t)n are then fed to the LM to computethe forward logits at the (n+1)th step (Eq.2). Thisway, information from the backward pass is inte-grated into the left-to-right generation process toproduce text that is informed by the constraint.

    We pre-define the number of tokens N requiredby the backward pass, but we allow the forwardpass to generate more than N tokens if those areneeded to obtain complete sentences. In that case,we set the logits of the extra tokens to the forwardlogits, without mixing: ỹ(t)n = ỹ

    (t),fn for n > N .

    We then prune any trailing tokens in the sampledtext to get complete sentences.

    3.2 RankingThe output of the decoding step is a list of candi-date generations for each iteration: Ys = {Y (t)|t =1, ..., T}. We further use an unsupervised approachto rank and pick the best sample as the final out-put. Specifically, we take advantage of the BERTmodel, which was pre-trained with a next-sentenceprediction (NSP) objective. Given two sentencesA and B, we use NSP to compute the likelihood ofB following A as a proxy for coherence:

    c(A,B) = BERT NSP(A,B), (4)

    where c(·, ·) denotes the coherence score. Thisscore is used to evaluate the quality of a givencandidate continuation Y by measuring (1) its com-patibility with the subsequent text of the contextX , (2) the internal consistency of Y if it consistsof multiple sentences, and (3) the compatibility ofY with its right-side text when it is applicable.

    Model BLEU-4 ROUGE-L BERT

    SupervisedSup 32.82 25.60 49.38+COMET-Emb 33.97 26.06 49.71UnsupervisedZero-ShotX 18.30 14.99 39.36Zero-ShotZX 15.90 14.23 40.03Zero-ShotX -Ranked 19.24 16.76 41.58Zero-ShotZX -Ranked 20.13 17.25 41.93DELOREAN 22.60 18.94 42.86

    Human 53.56 30.40 53.30

    Table 1: Automatic evaluation results on the abductivetask, using the test set of ART.

    4 Task 1: Abductive Reasoning

    Each instance in the ART dataset consists of twoobservations O1, O2 and a hypothesis H that ex-plains the two observations. These inputs naturallymap to X , Z and Y in our framework. Formally,the abductive generation task aims to maximizeP (Y |X,Z) – i.e. models must consider both leftand right contexts (X and Z) jointly.

    4.1 Task Setup

    Constraints We maximizeZ givenXỸ by defin-ing the loss function as the cross-entropy loss ofgenerating Z given XỸ with the LM:3

    L(X, Ỹ , Z) := −∑NZ

    n=1 logPLM(zn|X, Ỹ , Z1:n−1), (5)

    where PLM(aj |a1:j−1) is the likelihood of generat-ing token aj given the preceding text a1:j−1.

    Ranking We rank candidates by the overall co-herence after inserting Y in between X and Z:

    ranking score(Y ) = c(XY,Z) + c(X,Y Z). (6)

    Hyperparameters We use GPT2-345M (Rad-ford et al., 2019b) as the pre-trained LM for allmodels. We use the ART development set to se-lect hyperparameters. We use greedy decoding forour method and top k decoding (Fan et al., 2018)(k = 40, τ = 0.7) for our baselines. Other hyper-parameters are outlined in Appendix A.1.

    4.2 Experimental Setup

    Baselines We compare our method against base-lines from Bhagavatula et al. (2019). The unsu-pervised baselines use a pre-trained GPT-2 model

    3Note that this is applied to each prefix of Ỹ , althoughsome of them are not complete sentences.

  • 799

    In April, Bob decided he need to do his taxes. The accountant prepared and filed Bob's taxes.

    Bob then went to the IRS to do his.

    Ray drive his car on a steep mountain road. Ray was fine but his car was totaled.

    As he drives the car to the top of the mountain his car is hit by a car.

    Peter was excited to go to the Sanders rally in New Hampshire. He couldn't wait to vote for him.

    ?

    ?

    ?

    He has a long history of supporting Bernie Sanders and was excited to see him in person.

    Figure 3: Examples of generated hypotheses on three abductive reasoning cases. Given observations O1 and O2,DELOREAN generates a hypothesis explaining the observations.

    to generate Y given a prompt text—either the ob-servation X alone (Zero-ShotX ) or Z〈e〉X (Zero-ShotZX ), where 〈e〉 denotes a special end-of-texttoken. The supervised method (Sup) followsthe same input format as Zero-ShotZX , but fine-tunes GPT-2 on the ART training set. Finally,our knowledge-informed baseline (+COMET-Emb)further augments the representation of Sup withknowledge from COMET (Bosselut et al., 2019).

    To separately study the contribution of our de-coding strategy and ranking component, we alsoreport the performance of ranking the baseline out-puts. Specifically, we let each baseline generate 20candidates and rank them by coherence (Eq. 6).4

    4.3 Results

    Automatic Evaluation We report the same met-rics as Bhagavatula et al. (2019): BLEU-4 (Pa-pineni et al., 2002), ROUGE-L (Lin, 2004) andBERTSCORE (Zhang et al., 2019) (with the bert-base-uncased model). The results in Table 1 showthat DELOREAN performs best among the unsuper-vised systems across all metrics. We also note thatour ranking step improves both the performance ofour model and that of the zero-shot baselines.

    Human Evaluation We conduct two sets of hu-man evaluations on 100 test examples using crowd-workers from Amazon Mechanical Turk. In thescoring setting, presented in Table 2, workers werepresented a pair of observations (X and Z) anda generated hypothesis Y , and asked to rate thecoherence of the hypothesis with respect to the ob-servation X (X-Y ), the observation Z (Y -Z), andboth (X-Y -Z), on a 4-point Likert scale. In the

    4We tried ablating the ranking component from our methodin preliminary experiments, and found that ranking is essentialto obtaining good performance. By adding ranking to ourbaselines, we assess the contribution of our decoding strategy.

    Model X-Y Y -Z X-Y -Z

    SupervisedSup 0.510 0.375 0.314+COMET-Emb 0.466 0.342 0.286UnsupervisedZero-ShotZX 0.233 0.103 0.108Zero-ShotX -Ranked 0.478 0.208 0.195Zero-ShotZX -Ranked 0.474 0.238 0.236DELOREAN 0.522 0.325 0.297

    Human 0.879 0.823 0.783

    Table 2: Human calibration results on test set of ART .All scores are normalized to [0, 1].

    Overall - Human Judges Preferred

    Our model Neutral ComparatorDELOREAN 21% 43% 36% SupDELOREAN 25% 44% 31% +COMET-Emb

    DELOREAN 23% 62% 15% Zero-ShotX -RankedDELOREAN 27% 50% 23% Zero-ShotXZ -Ranked

    DELOREAN 3% 11% 86% Human

    Table 3: Human pairwise comparison results on the testset of ART, between DELOREAN and each of the base-lines, by jointly considering all 3 criteria from Table 2.“Neutral” means “equally good/bad”.

    pairwise comparison setting, presented in Table 3,workers were presented the outputs from a pair ofsystems (DELOREAN and baseline) and asked tochoose the better output in terms of the same co-herence criteria. Each example was labeled by 3workers.5

    In both evaluation setups, our method sub-stantially outperform the unsupervised baselines,achieving a relative improvement of 36%− 215%with respect to Y -Z coherence. Our method alsooutperform the supervised methods with respect toX-Y coherence (Table 2), and achieve competitiveperformance in the pairwise comparison (Table 3).

    5The average inter-rater agreement measured by Fleiss’κ = 0.44 (“moderate agreement”) (Fleiss, 1971).

  • 800

    BLEU ROUGE BERT

    Supervised + DiscriminativeSup+Disc 75.71 72.72 62.39Unsupervised+ DiscriminativeRecon+CF 75.92 70.93 62.49UnsupervisedFT 4.06 24.09 62.55FT+CF 4.02 24.35 62.63Pretrained-onlyZero-Shots1s′2 1.74 21.41 59.31Zero-Shots1s′2 -Ranked 2.26 25.81 60.07DELOREAN 21.35 40.73 63.36

    Human 64.93 67.64 61.87

    Table 4: Automatic evaluation results of counterfactualstory rewriting, on the test set of TIMETRAVEL.

    Coherence - Human Judges Preferred

    Our model Neutral ComparatorDELOREAN 25% 58% 17% Sup+Disc

    DELOREAN 23% 70% 7% Recon+CFDELOREAN 22% 48% 30% FT

    DELOREAN 18% 60% 22% Zero-Shots1s′2DELOREAN 27% 42% 31% Zero-Shots1s′2 -Ranked

    DELOREAN 10% 29% 61% Human

    Min-Edits - Human Judges Preferred

    Our model Neutral ComparatorDELOREAN 4% 17% 79% Sup+Disc

    DELOREAN 1% 14% 85% Recon+CFDELOREAN 21% 76% 3% FT

    DELOREAN 28% 71% 1% Zero-Shots1s′2DELOREAN 37% 56% 7% Zero-Shots1s′2 -Ranked

    M+Sup 8% 22% 70% Human

    Table 5: Human pairwise comparison results on thecounterfactual task, between our best model and eachbaseline with respect to coherence and min-edits.

    Again, the ranking component contributes to in-creasing performance for the zero-shot baselines.Finally, the large performance gap between themethods and human-written explanations stressesthe difficulty of this reasoning task and warrantsfuture research.

    Qualitative Analysis Figure 3 presents two ex-ample outputs produced by DELOREAN. We cansee our approach generates reasonable hypothesesby taking into account both the past and future con-texts. For instance, in the first example, the futureobservation (O2) “car was totaled” indicates thatRay had a car accident, which is correctly capturedin the generated hypothesis “car is hit by a car”.

    1.2

    1.3

    1.4

    1.5

    1.6

    1.7

    1.8

    1.9

    2

    0 0.2 0.4 0.6 0.8 1 1.2 1.4

    HA

    RMO

    NIC

    MEA

    N

    HARMONIC MEAN OF COHERENCE AND MIN_EDIT SCORES

    Delorean RECON ZS_ranked ZS FT FT+CF

    βCOHERENCE MIN_EDIT

    Figure 4: Human calibration results for counterfactualgeneration in terms of weighted harmonic mean of co-herence and min-edit, Hβ =

    (1+β2)·coherence·min editβ2·coherence+min edit , as

    a function of the scaling factor β. Low β values assignmore weight to coherence, and high β values empha-size more on min-edit.

    5 Task 2: Counterfactual ReasoningGiven an original story ending Z of story con-text Xori, and a counterfactual condition X thatchanges Xori to invalidate Z (see Fig. 1), the taskis to generate a new story ending Y that minimallyedits the original endingZ to regain coherence withthe counterfactual condition X (Qin et al., 2019a).

    5.1 Task Setup

    Constraints The constraint we enforce is that Yis close to Z (i.e., minimal edits). We impose thisconstraint by minimizing their KL divergence:

    L(X, Ỹ , Z) :=KL(Z‖softmax(Ỹ /τ)

    ), (7)

    where, with a slight abuse of notation, Z is theone-hot distribution of the tokens in the originalending. That is, we encourage the generated logitsto recover the original ending.

    Ranking We rank the candidates based on boththeir coherence with the context, as well as theinternal coherence between the multiple sentencesof each candidate (rewritten ending, consists of 3sentences). More concretely, given a candidate Y ,we compute the aggregated coherence score:

    ranking score(Y ) = c(X,Y ) +∑S−1

    s=1 c(Y [s], Y [s+ 1]), (8)

    where each candidate has S sentences (here, S = 3)and Y [s] denotes the sth sentence.

  • 801

    Kay was shopping online for an art set.

    She couldn't find one she liked due to its high price.

    Kay looked at the product reviews for the art set. Twenty of twenty reviewers noted that the pens in the set leaked. Kay did not buy the art set.

    Kay looked at the price tag for the art set. She was shocked to see that the price was $1,000. Kay was not happy with the price.

    She found one she liked due to its reasonable price.Story Context

    Counterfactual Condition

    Original Ending

    Rewritten Ending

    She knew of a cool place online that did custom fits really cheaply, and ordered from there.

    They browsed shirts from a variety of stores. Tara picked out a floral patterned shirt that she liked best. Tara looked forward to wearing it.

    They sent her a shirt that fit her perfectly. Tara was so excited to wear it. She looked forward to wearing it.

    Tara wanted to buy a new shirt for her upcoming school formal. She went to the mall with her mom.Story Context

    Counterfactual Condition

    Original Ending

    Rewritten Ending

    Shane enjoyed volunteering his time helping others.

    John was not allowed to be friends with Shane anymore. this bothered John greatly but his mom explained the reasons. She explained that Shane was a bad influence on John.

    John was a good student and was always looking for ways to help others. They were both very kind and caring people. Shane was a member of the Boy Scouts of America.

    Shane and John were best friends at school. Shane was caught stealing and got suspended from school.Story Context

    Counterfactual Condition

    Original Ending

    Rewritten Ending

    Figure 5: Examples of generated story endings on three counterfactual reasoning cases. Given a story context, acounterfactual condition, and a original ending, DELOREAN generates a rewritten ending which is coherent withthe counterfactual condition and is similar to the original ending.

    Hyperparameters We largely follow the samesetting as in the abductive reasoning task, but tunehyperparameters on the TIMETRAVEL develop-ment set. Deviations from these settings are out-lined in Appendix A.2.

    5.2 Experimental Setup

    Baselines We compare our method with base-lines from Qin et al. (2019a). The zero-shot base-line uses the pre-trained GPT-2 model to generateY as a continuation to the counterfactual conditionX . It is the most apt comparison to our methodwhich also doesn’t require additional supervision.We also experiment with two baselines that fine-tune GPT-2 on the original story XoriZ to fit themodel to the story domain, either with an LM ob-jective (FT) or a tailored conditional objective thatencourages minimal edits of Z (Recon+CF).6 Fi-nally, we report the performance of a supervisedbaseline (Sup), in which GPT-2 is fine-tuned toproduce the gold Y from XoriZ and X .

    5.3 Results

    Automatic Evaluation Following Qin et al.(2019a), we report BERTSCORE (Zhang et al.,2019), which was shown to best correlate with hu-man judges’ notion of counterfactual coherence,and BLEU-4 and ROUGE-L, which better mea-sure minimum-edits. We find that the discrimina-tive baselines achieve the highest degree of plot

    6See Qin et al. (2019a) for more details.

    fidelity. Meanwhile, DELOREAN achieves the high-est BERTSCORE for counterfactual coherence.

    Human Evaluation We repeat the human eval-uation setup from Section 4.3. Presented with theoriginal story, the counterfactual condition X , andthe generated ending Y , workers were asked tojudge (1) the coherence of Y with respect to theX; and (2) to what extent the generated endingminimally-edits the original ending.7 In order tojudge both criteria, we report the weighted har-monic mean Hβ of these scores across a range ofweights β (Figure 4).

    Our results show that DELOREAN is the onlymodel that maintains a consistent balance betweencoherence (1.66) and minimal edits (1.54). Whilethe ranking-augmented zero-shot model producesthe most coherent endings (coherence = 1.8), it de-viates from the original ending. As β is increased(i.e., increasing importance of minimal edits), itsweighted performance drops considerably, indicat-ing it cannot generate new endings that follow theoriginal plot of the story (min-edit = 1.25). Con-versely, Recon+CF generates stories that are faith-ful to the original endings, but are far less coher-ent with the counterfactual condition (coherence =1.23). Through human annotation, we found thatRecon+CF copies the original ending word-for-word in a 84% of cases.

    The pairwise comparison results in Table 5

    7Fair inter-rater agreement with Fleiss’ κ = 0.34

  • 802

    parallel these observations. DELOREAN signifi-cantly outperforms the discriminative approaches(Recon+CF and Sup+Disc) in coherence, whilefalling short of the Zero-shot re-ranked baselines.In minimal edits, this pattern is flipped with ourapproach outperforming Zero-shot baselines con-siderably and losing to the discriminative baselines.

    Qualitative Analysis Figure 5 provides two ex-ample results for counterfactual story rewriting byDELOREAN. The approach successfully capturesthe causal relations between events and properlyrewrites the endings with minimal edits. For in-stance, in the first example, given the counterfac-tual condition that “Tara ordered a shirt online” (asopposed to the original “went to mall”), the rewrit-ten ending is about “sent shirt” to Tara (as opposedto the original “browsed from stores”). The lastsentence of the original ending “She looked for-ward to wearing it” is correctly preserved as it iscoherent with the counterfactual condition.

    6 Related Work

    Unsupervised text generation. Unsupervisedapproaches are often applied to problems that copyinformation from a source text into decoded text.Unsupervised paraphrasing requires repeating thisinformation (Miao et al., 2019; Bao et al., 2019),as does translation, but with a bilingual transfor-mation (Artetxe et al., 2017; Lample et al., 2018).In summarization there is an additional task to se-lect a subset of the original text (Baziotis et al.,2019; Schumann et al., 2020; West et al., 2019). Incases where information is mostly copied from theoriginal, auto-encoding objectives can ensure thecorrect information is captured (Bao et al., 2019;Baziotis et al., 2019; Artetxe et al., 2017). Thiswork tackles problems where generation is moreopen-ended. Rather than reproducing informationfrom the prompt, generations should agree with andexpand on it, making autoencoding less applicable.

    Controllable language generation. Earlier ap-proaches for controllable generation involved pre-serving the content of text while changing it alongdiscrete dimensions, such as theme, sentiment, orstyle (Koncel-Kedziorski et al., 2016; Hu et al.,2017; Ficler and Goldberg, 2017; Shen et al., 2017;Lample et al., 2019). Recent works such as Grover(Zellers et al., 2019) and CTRL model (Keskaret al., 2019) used these ideas to augment trans-former language models that can condition on struc-

    tured metadata such as source, domain, etc. ThePlug & Play model (PPLM; Dathathri et al., 2019)controls topic and sentiment in an approach similarto ours that involves forward and backward passesto update token distributions. However, PPLMrelies on trained attribute discriminators for super-vision, while our method is unsupervised. Whilethese models are restricted to specific dimensions,often with pre-defined values, our model can adjustto any open-ended textual constraint. Perhaps themost similar work in that aspect is the “text infill-ing” models, which, however, are in a more narrowsetting by filling only a relatively short text span(Devlin et al., 2018; Zhu et al., 2019; Donahueet al., 2020), and more restrictive due to the relianceon an extra right-to-left language model (Sun et al.,2017) or a pre-specified generation length (Zeldeset al., 2020, which is not publicly available).

    Reasoning about narratives. A prominent re-source from recent years is the RocStories corpus(Mostafazadeh et al., 2016b), consisting of 98Kcrowdsourced 5-sentence everyday life stories. Itwas used for the story cloze task whose goal was topredict the story ending from its first 4 sentences,but gained popularity and became the base of ad-ditional benchmarks (Rashkin et al., 2018). Addi-tional related work includes “script knowledge”, i.e.learning about prototypical series of events (Schankand Abelson, 1977; Chambers and Jurafsky, 2008;Pichotta and Mooney, 2014), temporal common-sense (Granroth-Wilding and Clark, 2016; Li et al.,2018), and modeling pre- and post- conditions ofevents (Roemmele et al., 2011; Sap et al., 2019;Bosselut et al., 2019). Qin et al. (2019b) studiedconversation modeling that reads and connects thedots of events in related documents. Finally, a re-cent line of work explores counterfactual questionsin reading comprehension (Huang et al., 2019; Tan-don et al., 2019), but instantiates the problem ofcounterfactual reasoning as a multiple choice task.

    7 Conclusion

    We presented DELOREAN, an unsupervised LM-based approach to generate text conditioned onpast context as well as future constraints, throughforward and backward passes considering each con-dition. We demonstrated its effectiveness for ab-ductive and counterfactual reasoning, on which itperformed substantially better than unsupervisedbaselines. Our method is general and can be easilyadapted for other generative reasoning tasks.

  • 803

    Acknowledgements

    We thanks the anonymous reviewers and colleagesat UW NLP and AI2 for many helpful comments.This research was supported in part by DARPACwC through ARO (W911NF15-1-0543), DARPAMCS program through NIWC Pacific (N66001-19-2-4031), and Allen Institute for AI.

    ReferencesHenning Andersen. 1973. Abductive and deductive

    change. Language, pages 765–793.

    Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2017. Unsupervised neural ma-chine translation. arXiv preprint arXiv:1710.11041.

    Yu Bao, Hao Zhou, Shujian Huang, Lei Li, LiliMou, Olga Vechtomova, Xinyu Dai, and JiajunChen. 2019. Generating sentences from disentan-gled syntactic and semantic spaces. arXiv preprintarXiv:1907.05789.

    Christos Baziotis, Ion Androutsopoulos, Ioannis Kon-stas, and Alexandros Potamianos. 2019. Seq3: Dif-ferentiable sequence-to-sequence-to-sequence au-toencoder for unsupervised abstractive sentencecompression. In NAACL-HLT.

    Chandra Bhagavatula, Ronan Le Bras, ChaitanyaMalaviya, Keisuke Sakaguchi, Ari Holtzman, Han-nah Rashkin, Doug Downey, Wen-tau Yih, and YejinChoi. 2019. Abductive commonsense reasoning. InInternational Conference on Learning Representa-tions.

    Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.2019. COMET: Commonsense transformers for au-tomatic knowledge graph construction. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 4762–4779,Florence, Italy. Association for Computational Lin-guistics.

    Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference.arXiv preprint arXiv:1508.05326.

    Nathanael Chambers and Dan Jurafsky. 2008. Unsu-pervised learning of narrative event chains. In Pro-ceedings of ACL-08: HLT, pages 789–797.

    Sumanth Dathathri, Andrea Madotto, Janice Lan, JaneHung, Eric Frank, Piero Molino, Jason Yosinski, andRosanne Liu. 2019. Plug and play language mod-els: a simple approach to controlled text generation.arXiv preprint arXiv:1912.02164.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

    C. Donahue, M. Lee, and P. Liang. 2020. Enablinglanguage models to fill in the blanks. In ACL.

    Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-erarchical neural story generation. In ACL.

    Jessica Ficler and Yoav Goldberg. 2017. Controllinglinguistic style aspects in neural language genera-tion. In Proceedings of the Workshop on StylisticVariation, pages 94–104.

    Joseph L Fleiss. 1971. Measuring nominal scale agree-ment among many raters. Psychological bulletin,76(5):378.

    Matthew L Ginsberg. 1986. Counterfactuals. Artificialintelligence, 30(1):35–79.

    Nelson Goodman. 1947. The problem of counter-factual conditionals. The Journal of Philosophy,44(5):113–128.

    Mark Granroth-Wilding and Stephen Clark. 2016.What happens next? event prediction using a com-positional neural network model. In Thirtieth AAAIConference on Artificial Intelligence.

    Jerry R Hobbs, Mark E Stickel, Douglas E Appelt, andPaul Martin. 1993. Interpretation as abduction. Ar-tificial intelligence, 63(1-2):69–142.

    Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Toward con-trolled generation of text. In ICML.

    Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, andYejin Choi. 2019. Cosmos qa: Machine readingcomprehension with contextual commonsense rea-soning. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages2391–2401.

    Steve D Isard. 1974. What would you have done if...?Theoretical Linguistics, 1(1-3):233–256.

    Philip Nicholas Johnson-Laird. 2006. How we reason.Oxford University Press, USA.

    Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,Caiming Xiong, and Richard Socher. 2019. Ctrl: Aconditional transformer language model for control-lable generation. arXiv preprint arXiv:1909.05858.

    Rik Koncel-Kedziorski, Ioannis Konstas, Luke Zettle-moyer, and Hannaneh Hajishirzi. 2016. A theme-rewriting approach for generating algebra wordproblems. In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing, pages 1617–1628.

    https://doi.org/10.18653/v1/P19-1470https://doi.org/10.18653/v1/P19-1470

  • 804

    Sarit Kraus, Daniel Lehmann, and Menachem Magidor.1990. Nonmonotonic reasoning, preferential modelsand cumulative logics. Artificial intelligence, 44(1-2):167–207.

    Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, and Marc’Aurelio Ranzato. 2018.Phrase-based & neural unsupervised machine trans-lation. arXiv preprint arXiv:1804.07755.

    Guillaume Lample, Sandeep Subramanian, Eric Smith,Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2019. Multiple-attribute text rewrit-ing. In ICLR.

    Carolin Lawrence and Stefan Riezler. 2018. Improvinga neural semantic parser by counterfactual learningfrom human bandit feedback. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1820–1830.

    Carolin Lawrence, Artem Sokolov, and Stefan Riezler.2017. Counterfactual learning from bandit feedbackunder deterministic logging: A case study in statisti-cal machine translation. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 2566–2576.

    Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Con-structing narrative event evolutionary graph forscript event prediction. In IJCAI.

    Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

    Hugo Mercier and Dan Sperber. 2017. The enigma ofreason. Harvard University Press.

    Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and LeiLi. 2019. Cgmh: Constrained sentence generationby metropolis-hastings sampling. In Proceedings ofthe AAAI Conference on Artificial Intelligence, vol-ume 33, pages 6834–6842.

    Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016a. A cor-pus and evaluation framework for deeper under-standing of commonsense stories. arXiv preprintarXiv:1604.01696.

    Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James F. Allen. 2016b. A cor-pus and cloze evaluation for deeper understanding ofcommonsense stories. In HLT-NAACL.

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation. In ACL, pages 311–318.

    Judea Pearl and Dana Mackenzie. 2018. The book ofwhy: the new science of cause and effect. BasicBooks.

    Charles Sanders Peirce. 1960. Collected papers ofcharles sanders peirce, volume 2. Harvard Univer-sity Press.

    Karl Pichotta and Raymond Mooney. 2014. Statisti-cal script learning with multi-argument events. InEACL, pages 220–229.

    Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chan-dra Bhagavatula, Elizabeth Clark, and Yejin Choi.2019a. Counterfactual story reasoning and gener-ation. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages5046–5056.

    Lianhui Qin, Michel Galley, Chris Brockett, XiaodongLiu, Xiang Gao, Bill Dolan, Yejin Choi, and Jian-feng Gao. 2019b. Conversing by reading: Con-tentful neural conversation with on-demand machinereading. In ACL, pages 5427–5436.

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019a. Languagemodels are unsupervised multitask learners. -.

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019b. Lan-guage models are unsupervised multitask learners.OpenAI Blog, 1:8.

    Hannah Rashkin, Antoine Bosselut, Maarten Sap,Kevin Knight, and Yejin Choi. 2018. Modelingnaive psychology of characters in simple common-sense stories. arXiv preprint arXiv:1805.06533.

    Raymond Reiter. 1988. Nonmonotonic reasoning. InExploring artificial intelligence, pages 439–481. El-sevier.

    Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In 2011 AAAI Spring Symposium Series.

    Maarten Sap, Ronan LeBras, Emily Allaway, Chan-dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,Brendan Roof, Noah A Smith, and Yejin Choi. 2019.ATOMIC: An atlas of machine commonsense for if-then reasoning. In AAAI.

    Roger C Schank and Robert P Abelson. 1977. Scripts,plans, goals and understanding: An inquiry into hu-man knowledge structures.

    Raphael Schumann, Lili Mou, Yao Lu, Olga Vech-tomova, and Katja Markert. 2020. Discrete op-timization for unsupervised sentence summariza-tion with word-level extraction. arXiv preprintarXiv:2005.01791.

    Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In Advances in neural informa-tion processing systems, pages 6830–6841.

  • 805

    Youngseo Son, Anneke Buffone, Joe Raso, AllegraLarche, Anthony Janocko, Kevin Zembroski, H An-drew Schwartz, and Lyle Ungar. 2017. Recognizingcounterfactual thinking in social media texts. In Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 654–658.

    William Starr. 2019. Counterfactuals. In Edward N.Zalta, editor, The Stanford Encyclopedia of Philos-ophy, fall 2019 edition. Metaphysics Research Lab,Stanford University.

    Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirec-tional beam search: Forward-backward inference inneural sequence models for fill-in-the-blank imagecaptioning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages6961–6969.

    Niket Tandon, Bhavana Dalvi Mishra, Keisuke Sak-aguchi, Antoine Bosselut, and Peter Clark. 2019.Wiqa: A dataset for” what if...” reasoning over pro-cedural text. In EMNLP.

    Peter West, Ari Holtzman, Jan Buys, and Yejin Choi.

    2019. Bottlesum: Unsupervised and self-supervisedsentence summarization using the information bot-tleneck principle. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 3743–3752.

    Yoel Zeldes, Dan Padnos, and Barak Peleg. 2020.Haim-1.5 - the next generation.

    Rowan Zellers, Ari Holtzman, Hannah Rashkin,Yonatan Bisk, Ali Farhadi, Franziska Roesner, andYejin Choi. 2019. Defending against neural fakenews. In Advances in Neural Information Process-ing Systems, pages 9051–9062.

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2019. BERTScore:Evaluating text generation with BERT. CoRR,abs/1904.09675.

    Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Textinfilling. arXiv preprint arXiv:1901.00158.

    https://www.ai21.com/haim-1point5