NaoyaInoue Tohoku Universitynaoya-i/resources/180803_saisentannlp.pdf · NaoyaInoue Tohoku University RIKEN AIP Proceedings of the 56th Annual Meeting of the Association for Computational

Naoya InoueTohoku University

RIKEN AIP

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1–11Melbourne, Australia, July 15 - 20, 2018. c�2018 Association for Computational Linguistics

1

Event2Mind: Commonsense Inference on Events, Intents, and Reactions

Hannah Rashkin†⇤

Maarten Sap†⇤

Emily Allaway†

Noah A. Smith†

Yejin Choi†‡

†Paul G. Allen School of Computer Science & Engineering, University of Washington‡Allen Institute for Artificial Intelligence

{hrashkin,msap,eallaway,nasmith,yejin}@cs.washington.edu

Abstract

We investigate a new commonsense in-ference task: given an event describedin a short free-form text (“X drinks cof-fee in the morning”), a system reasonsabout the likely intents (“X wants to stayawake”) and reactions (“X feels alert”) ofthe event’s participants. To support thisstudy, we construct a new crowdsourcedcorpus of 25,000 event phrases coveringa diverse range of everyday events andsituations. We report baseline perfor-mance on this task, demonstrating thatneural encoder-decoder models can suc-cessfully compose embedding represen-tations of previously unseen events andreason about the likely intents and reac-tions of the event participants. In addition,we demonstrate how commonsense infer-ence on people’s intents and reactions canhelp unveil the implicit gender inequalityprevalent in modern movie scripts.

1 Introduction

Understanding a narrative requires commonsensereasoning about the mental states of people in re-lation to events. For example, if “Alex is dragginghis feet at work”, pragmatic implications aboutAlex’s intent are that “Alex wants to avoid do-ing things” (Figure 1). We can also infer thatAlex’s emotional reaction might be feeling “lazy”or “bored”. Furthermore, while not explicitlymentioned, we can infer that people other thanAlex are affected by the situation, and these peopleare likely to feel “frustrated” or “impatient”.

This type of pragmatic inference can potentiallybe useful for a wide range of NLP applications

⇤These two authors contributed equally.

Figure 1: Examples of commonsense inference onmental states of event participants. In the third ex-ample event, common sense tells us that Y is likelyto feel betrayed as a result of X reading their diary.

that require accurate anticipation of people’s in-tents and emotional reactions, even when they arenot explicitly mentioned. For example, an idealdialogue system should react in empathetic waysby reasoning about the human user’s mental statebased on the events the user has experienced, with-out the user explicitly stating how they are feel-ing. Similarly, advertisement systems on socialmedia should be able to reason about the emo-tional reactions of people after events such as massshootings and remove ads for guns which mightincrease social distress (Goel and Isaac, 2016).Also, pragmatic inference is a necessary step to-ward automatic narrative understanding and gen-eration (Tomai and Forbus, 2010; Ding and Riloff,2016; Ding et al., 2017). However, this type of so-cial commonsense reasoning goes far beyond thewidely studied entailment tasks (Bowman et al.,2015; Dagan et al., 2006) and thus falls outsidethe scope of existing benchmarks.

In this paper, we introduce a new task, corpus,

(ACL2018)

Some figures are cited from the original paper and poster.

/18

Goal• Commonsense inference!–Focuses on stereotypical intent and reactions

• Applications–More intelligent dialogue systems, advertisement systems–Narrative understanding, generation

2

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1–11Melbourne, Australia, July 15 - 20, 2018. c�2018 Association for Computational Linguistics

1

Event2Mind: Commonsense Inference on Events, Intents, and Reactions

Hannah Rashkin†⇤

Maarten Sap†⇤

Emily Allaway†

Noah A. Smith†

Yejin Choi†‡

†Paul G. Allen School of Computer Science & Engineering, University of Washington‡Allen Institute for Artificial Intelligence

{hrashkin,msap,eallaway,nasmith,yejin}@cs.washington.edu

Abstract

We investigate a new commonsense in-ference task: given an event describedin a short free-form text (“X drinks cof-fee in the morning”), a system reasonsabout the likely intents (“X wants to stayawake”) and reactions (“X feels alert”) ofthe event’s participants. To support thisstudy, we construct a new crowdsourcedcorpus of 25,000 event phrases coveringa diverse range of everyday events andsituations. We report baseline perfor-mance on this task, demonstrating thatneural encoder-decoder models can suc-cessfully compose embedding represen-tations of previously unseen events andreason about the likely intents and reac-tions of the event participants. In addition,we demonstrate how commonsense infer-ence on people’s intents and reactions canhelp unveil the implicit gender inequalityprevalent in modern movie scripts.

1 Introduction

Understanding a narrative requires commonsensereasoning about the mental states of people in re-lation to events. For example, if “Alex is dragginghis feet at work”, pragmatic implications aboutAlex’s intent are that “Alex wants to avoid do-ing things” (Figure 1). We can also infer thatAlex’s emotional reaction might be feeling “lazy”or “bored”. Furthermore, while not explicitlymentioned, we can infer that people other thanAlex are affected by the situation, and these peopleare likely to feel “frustrated” or “impatient”.

This type of pragmatic inference can potentiallybe useful for a wide range of NLP applications

⇤These two authors contributed equally.

PersonX drags PersonX's feet

PersonX cooks thanksgiving

dinner

to avoid doing things

lazy, bored

frustrated, impatient

to impress their family

tired, a sense of belonging

impressed

X's intent

X's reaction

Y's reaction

X's intent

X's reaction

Y's reaction

Figure 1: Examples of commonsense inference onmental states of event participants. In the third ex-ample event, common sense tells us that Y is likelyto feel betrayed as a result of X reading their diary.

that require accurate anticipation of people’s in-tents and emotional reactions, even when they arenot explicitly mentioned. For example, an idealdialogue system should react in empathetic waysby reasoning about the human user’s mental statebased on the events the user has experienced, with-out the user explicitly stating how they are feel-ing. Similarly, advertisement systems on socialmedia should be able to reason about the emo-tional reactions of people after events such as massshootings and remove ads for guns which mightincrease social distress (Goel and Isaac, 2016).Also, pragmatic inference is a necessary step to-ward automatic narrative understanding and gen-eration (Tomai and Forbus, 2010; Ding and Riloff,2016; Ding et al., 2017). However, this type of so-cial commonsense reasoning goes far beyond thewidely studied entailment tasks (Bowman et al.,2015; Dagan et al., 2006) and thus falls outsidethe scope of existing benchmarks.

In this paper, we introduce a new task, corpus,

/18

Contributions1. New corpus of stereotypical intents and reactions

over 25,000 everyday events and situations–Task formulation that aims to generate the textual

descriptions of intents and reactions (c.f. sentiment analysis, textual entailment)

2. Establish baseline performance (based on neural encoder-decoder models) … and some modeling tips

3. Demonstrate use-case of commonsense inference model on movie script analysis (if time allows…)

3

/18

Part 1/2: Corpus construction

• Step 1: Collecting event phrases from several corpora

• Step 2: Annotating with intent and reactions by crowdsourcing

4

/18

Event phrase extraction• Extract “everyday” events and situations from:–ROC Story training set (Mostafazadeh et al., 2016)–Google Syntactic N-grams (Goldberg and Orwant, 2013)–Spinn3r corpus (Gordon and Swanson, 2008)–Wiktionary entries (idioms only)

• Procedure1. Extract verb phrases from syntactic parse2. Replace entities and pronouns with typed variables3. Mask infrequently co-occurring arguments with blank–E.g. John eats samosa => PersonX eats _____

5

/18

Annotating with intents, reactions• By Amazon Mechanical Turk (3 workers per event)

6

4

EventPersonX punches PersonY's lights out

1. Does this event make sense enough for you to answer questions 2-5?

(Or does it have too many meanings?)

Yes, can answer No, can't answer or has too many meanings

Before the event2. Does PersonX willingly cause this event?

Yes No

a). Why?

(Try to describe without reusing words from the event)

Because PersonXwants ...

to (be)[write a reason]

[write another reason - optional]

[write another reason - optional]

Figure 2: Intent portion of our annotation task. Weallow annotators to label events as invalid if thephrase is unintelligible. The full annotation setupis shown in Figure 8 in the appendix.

(e.g., given the event description PersonXyells at the classroom, we can inferthat other people such as “students” in the class-room may be affected by the act of PersonX). Forquality control, we periodically removed workerswith high disagreement rates, at our discretion.

Coreference among Person variables Withthe typed Person variable setup, events involv-ing multiple people can have multiple meaningsdepending on coreference interpretation (e.g.,PersonX eats PersonY’s lunch hasvery different mental state implications fromPersonX eats PersonX’s lunch). Toprune the set of events that will be annotatedfor intent and reaction, we ran a preliminaryannotation to filter out candidate events that haveimplausible coreferences. In this preliminarytask, annotators were shown a combinatorial listof coreferences for an event (e.g., PersonXpunches PersonX’s lights out,PersonX punches PersonY’s lightsout) and were asked to select only the plausibleones (e.g., PersonX punches PersonY’slights out). Each set of coreferences wasannotated by 3 workers, yielding an overallagreement of =0.4. This annotation excluded8,406 events with implausible coreference fromour set (out of 17,806 events).

2.3 Mental State Descriptions

Our dataset contains nearly 25,000 event phrases,with annotators rating 91% of our extracted eventsas “valid” (i.e., the event makes sense). Of thoseevents, annotations for the multiple choice por-tions of the task (whether or not there exists in-tent/reaction) agree moderately, with an averageCohen’s = 0.45 (Table 2). The individual

scores generally indicate that turkers disagree halfas often as if they were randomly selecting an-swers.

Importantly, this level of agreement is accept-able in our task formulation for two reasons. First,unlike linguistic annotations on syntax or seman-tics where experts in the corresponding theorywould generally agree on a single correct label,pragmatic interpretations may better be defined asdistributions over multiple correct labels (e.g., af-ter PersonX takes a test, PersonX mightfeel relieved and/or stressed; de Marneffe et al.,2012). Second, because we formulate our task asa conditional language modeling problem, where adistribution over the textual descriptions of intentsand reactions is conditioned on the event descrip-tion, this variation in the labels is only as expected.

A majority of our events are annotated as will-ingly caused by the agent (86%, Cohen’s =0.48), and 26% involve other people ( = 0.41).Most event patterns in our data are fully instan-tiated, with only 22% containing blanks ( ).In our corpus, the intent annotations are slightlylonger (3.4 words on average) than the reaction an-notations (1.5 words).

3 Models

Given an event phrase, our models aim to gener-ate three entity-specific pragmatic inferences: Per-sonX’s intent, PersonX’s reaction, and others’ re-actions. The general outline of our model archi-tecture is illustrated in Figure 3.

The input to our model is an event pattern de-scribed through free-form text with typed vari-ables such as PersonX gives PersonYas a gift. For notation purposes, we describeeach event pattern E as a sequence of word em-beddings he1, e2, . . . , eni 2 Rn⇥D. This input isencoded as a vector hE 2 RH that will be usedfor predicting output. The output of the model isits hypotheses about PersonX’s intent, PersonX’sreaction, and others’ reactions (vi,vx, and vo, re-spectively). We experiment with representing the

�3(for intent,X’s reaction,Y’s reaction)

/18

Results (1/2)• 24,716 unique events are extracted • 91% of extracted events are annotated as “make

sense” (Cohen’s k = 0.45)–22% of events contain blank (__) slots

• 86% of events are annotated as “willingly caused” (Cohen’s k = 0.48)

• 57,094 intent/reactions annotations– Intent: 3.4 words per average–Reaction: 1.5 words per average

7

/18

Results (2/2)

8

/18

Part 2/2: Modeling• In / Out: event phrase / intent and reactions• Overall architecture: sequence-to-sequence model

9

5

PersonX’s Intent decoder

vi: start, a, fight vx: powerful vo: defensive

PersonX’s Reaction decoder

Others’ Reaction decoder

Pre-condition Post-condition

Event2mind Encoder

PersonX punches PersonY’s lights out

E = e1…en

f (e1…en)

hE

softmax(Wi hE+bi) softmax(Wx hE+bx) softmax(Wo hE+bo)

Figure 3: Overview of the model architecture.From an encoded event, our model predicts intentsand reactions in a multitask setting.

output in two decoding set-ups: three vectors in-terpretable as discrete distributions over words andphrases (n-gram reranking) or three sequences ofwords (sequence decoding).

Encoding events The input event phrase E iscompressed into an H-dimensional embedding hE

via an encoding function f : Rn⇥D ! RH :

hE = f(e1, . . . , en)

We experiment with several ways for defining f ,inspired by standard techniques in sentence andphrase classification (Kim, 2014). First, we exper-iment with max-pooling and mean-pooling overthe word vectors {ei}ni=1. We also consider a con-volutional neural network (ConvNet; LeCun et al.,1998) taking the last layer of the network as the en-coded version of the event. Lastly, we encode theevent phrase with a bi-directional RNN (specifi-cally, a GRU; Cho et al., 2014), concatenating thefinal hidden states of the forward and backwardcells as the encoding: hE = [

�!hn; �h1]. For hyper-

parameters and other details, we refer the reader toAppendix B.

Though the event sequences are typically rathershort (4.6 tokens on average), our model still ben-efits from the ConvNet and BiRNN’s ability tocompose words.

Pragmatic inference decoding We use threedecoding modules that take the event phrase em-bedding hE and output distributions of possiblePersonX’s intent (vi), PersonX’s reactions (vx),and others’ reactions (vo). We experiment withtwo different decoder set-ups.

First, we experiment with n-gram re-ranking,considering the |V | most frequent {1, 2, 3}-grams in our annotations. Each decoder projectsthe event phrase embedding hE into a |V |-dimensional vector, which is then passed througha softmax function. For instance, the distributionover descriptions of PersonX’s intent is given by:

vi = softmax(WihE + bi)

Second, we experiment with sequence generation,using RNN decoders to generate the textual de-scription. The event phrase embedding hE is set asthe initial state hdec of three decoder RNNs (usingGRU cells), which then output the intent/reactionsone word at a time (using beam-search at testtime). For example, an event’s intent sequence(vi = v

(0)i

v(1)i

. . .) is computed as follows:

v(t+1)i

= softmax(Wi RNN(v(t)i, h

(t)i,dec

) + bi)

Training objective We minimize the cross-entropy between the predicted distribution overwords and phrases, against the one actually ob-served in our dataset. Further, we employ multi-task learning, simultaneously minimizing the lossfor all three decoders at each iteration.

Training details We fix our input embeddings,using 300-dimensional skip-gram word embed-dings trained on Google News (Mikolov et al.,2013). For decoding, we consider a vocabulary ofsize |V | = 14,034 in the n-gram re-ranking setup.For the sequence decoding setup, we only con-sider the unigrams in V , yielding an output spaceof 7,110 at each time step.

We randomly divided our set of 24,716unique events (57,094 annotations) into a train-ing/dev./test set using an 80/10/10% split. Someannotations have multiple responses (i.e., a crowd-worker gave multiple possible intents and reac-tions), in which case we take each of the combina-tions of their responses as a separate training ex-ample.

4 Empirical Results

Table 3 summarizes the performance of differentencoding models on the dev and test set in termsof cross-entropy and recall at 10 predicted intentsand reactions. As expected, we see a moderateimprovement in recall and cross-entropy when us-ing the more compositional encoder models (Con-vNet and BiRNN; both n-gram and sequence de-

/18

Encoder (hE)1. Max-pooling and mean-pooling over the word vectors

2. A convolutional neural network (LeCun et al., 1998)

3. A bi-directional GRU (Cho et al., 2014)–Final hidden states of the forward and backward cells are

concatenated

10

/18

Decoder1. N-gram based ranker–Extract the |V| most frequent {1, 2, 3}-grams–Project onto |V|-dimensional vector:

!" = $%&'()* +"ℎ- + /0 , where 1 ∈ {4, *, %}– 4: X’s intention, *: X’s reaction, %: Y’s reaction

2. Sequence generator–Generate one word at a time–Use GRUs, where the initial hidden state ℎ",78"0 is ℎ-–!"09: = $%&'()* +";<<(!"0, ℎ",78"0 ) + /"

11

/18

Training• Multi-task training: joint training over three decoders• Multiple responses are treated as separate training exs

12

5

PersonX’s Intent decoder

vi: start, a, fight vx: powerful vo: defensive

PersonX’s Reaction decoder

Others’ Reaction decoder


Event2mind Encoder


E = e1…en

f (e1…en)

hE

softmax(Wi hE+bi) softmax(Wx hE+bx) softmax(Wo hE+bo)

Figure 3: Overview of the model architecture.From an encoded event, our model predicts intentsand reactions in a multitask setting.

output in two decoding set-ups: three vectors in-terpretable as discrete distributions over words andphrases (n-gram reranking) or three sequences ofwords (sequence decoding).

Encoding events The input event phrase E iscompressed into an H-dimensional embedding hE

via an encoding function f : Rn⇥D ! RH :

hE = f(e1, . . . , en)

We experiment with several ways for defining f ,inspired by standard techniques in sentence andphrase classification (Kim, 2014). First, we exper-iment with max-pooling and mean-pooling overthe word vectors {ei}ni=1. We also consider a con-volutional neural network (ConvNet; LeCun et al.,1998) taking the last layer of the network as the en-coded version of the event. Lastly, we encode theevent phrase with a bi-directional RNN (specifi-cally, a GRU; Cho et al., 2014), concatenating thefinal hidden states of the forward and backwardcells as the encoding: hE = [

�!hn; �h1]. For hyper-

parameters and other details, we refer the reader toAppendix B.

Though the event sequences are typically rathershort (4.6 tokens on average), our model still ben-efits from the ConvNet and BiRNN’s ability tocompose words.

Pragmatic inference decoding We use threedecoding modules that take the event phrase em-bedding hE and output distributions of possiblePersonX’s intent (vi), PersonX’s reactions (vx),and others’ reactions (vo). We experiment withtwo different decoder set-ups.

First, we experiment with n-gram re-ranking,considering the |V | most frequent {1, 2, 3}-grams in our annotations. Each decoder projectsthe event phrase embedding hE into a |V |-dimensional vector, which is then passed througha softmax function. For instance, the distributionover descriptions of PersonX’s intent is given by:

vi = softmax(WihE + bi)

Second, we experiment with sequence generation,using RNN decoders to generate the textual de-scription. The event phrase embedding hE is set asthe initial state hdec of three decoder RNNs (usingGRU cells), which then output the intent/reactionsone word at a time (using beam-search at testtime). For example, an event’s intent sequence(vi = v

(0)i

v(1)i

. . .) is computed as follows:

v(t+1)i

= softmax(Wi RNN(v(t)i, h

(t)i,dec

) + bi)

Training objective We minimize the cross-entropy between the predicted distribution overwords and phrases, against the one actually ob-served in our dataset. Further, we employ multi-task learning, simultaneously minimizing the lossfor all three decoders at each iteration.

Training details We fix our input embeddings,using 300-dimensional skip-gram word embed-dings trained on Google News (Mikolov et al.,2013). For decoding, we consider a vocabulary ofsize |V | = 14,034 in the n-gram re-ranking setup.For the sequence decoding setup, we only con-sider the unigrams in V , yielding an output spaceof 7,110 at each time step.

We randomly divided our set of 24,716unique events (57,094 annotations) into a train-ing/dev./test set using an 80/10/10% split. Someannotations have multiple responses (i.e., a crowd-worker gave multiple possible intents and reac-tions), in which case we take each of the combina-tions of their responses as a separate training ex-ample.

4 Empirical Results

Table 3 summarizes the performance of differentencoding models on the dev and test set in termsof cross-entropy and recall at 10 predicted intentsand reactions. As expected, we see a moderateimprovement in recall and cross-entropy when us-ing the more compositional encoder models (Con-vNet and BiRNN; both n-gram and sequence de-

Gold-standardintent

Gold-standardreaction

Gold-standardother’s reaction

/18

Evaluation: setup• Word embedding: 300-dimensional skip-gram word

embeddings trained on Google News (Mikolov et al., 2013) (fixed during training)

• Decoder setup–N-gram-based ranker: |V | = 14,034–Sequence generator: vocab = 7,110

• Held-out evaluation: divided 24,716 unique events (57,094 annotations) into a training/dev./test set using an 80/10/10% split

13

/18

Results (1/2)

• N-gram < sequence; pooling < ConvNet < RNN• Major source of improvement is intent prediction• Performs badly when idioms are input

14

6

Development Test

EncodingFunction Decoder Average

Cross-EntRecall @10 (%) Average

Cross-EntRecall @10 (%)

Intent XReact OReact Intent XReact OReact

max-pool n-gram 5.75 31 35 68 5.14 31 37 67mean-pool n-gram 4.82 35 39 69 4.94 34 40 68ConvNet n-gram 4.85 36 42 69 4.81 37 44 69BiRNN 300d n-gram 4.78 36 42 68 4.74 36 43 69BiRNN 100d n-gram 4.76 36 41 68 4.73 37 43 68

mean-pool sequence 4.59 39 36 67 4.54 40 38 66ConvNet sequence 4.44 42 39 68 4.40 43 40 67BiRNN 100d sequence 4.25 39 38 67 4.22 40 40 67

Table 3: Average cross-entropy (lower is better) and recall @10 (percentage of times the gold falls withinthe top 10 decoded; higher is better) on development and test sets for different modeling variations. Weshow recall values for PersonX’s intent, PersonX’s reaction and others’ reaction (denoted as “Intent”,“XReact”, and “OReact”). Note that because of two different decoding setups, cross-entropy betweenn-gram and sequence decoding are not directly comparable.

coding setups). Additionally, BiRNN models out-perform ConvNets on cross-entropy in both de-coding setups. Looking at the recall split acrossintent vs. reaction labels (“Intent”, “XReact” and“OReact” columns), we see that much of the im-provement in using these two models is within theprediction of PersonX’s intents. Note that recallfor “OReact” is much higher, since a majority ofevents do not involve other people.

Human evaluation To further assess the qual-ity of our models, we randomly select 100 eventsfrom our test set and ask crowd-workers to rategenerated intents and reactions. We present 5workers with an event’s top 10 most likely in-tents and reactions according to our model and askthem to select all those that make sense to them.We evaluate each model’s precision @10 by com-puting the average number of generated responsesthat make sense to annotators.

Figure 4 summarizes the results of this evalu-ation. In most cases, the performance is higherfor the sequential decoder than the correspondingn-gram decoder. The biggest gain from using se-quence decoders is in intent prediction, possiblybecause intent explanations are more likely to belonger. The BiRNN and ConvNet encoders consis-tently have higher precision than the mean-poolingwith the BiRNN-seq setup slightly outperformingother models. Unless otherwise specified, this isthe model we employ in further sections.

Figure 4: Average precision @10 of each model’stop ten responses in the human evaluation. Weshow results for various encoder functions (mean-pool, ConvNet, BiRNN-100d) combined with twodecoding setups (n-gram re-ranking, sequencegeneration).

Error analyses We test whether certain typesof events are easier for predicting commonsenseinference. In Figure 6, we show the differencein cross-entropy of the BiRNN 100d model onpredicting different portions of the developmentset including: Blank events (events containingnon-instantiated arguments), 2+ People events(events containing multiple different Person vari-ables), and Idiom events (events coming fromthe Wiktionary idiom list). Our results show that,while intent prediction performance remains sim-

/18

Results (2/2)• Asked crowdworkers if top-10 predictions make sense

(i.e. Precision@10)

• N-gram < sequence, pooling < RNN ≒ ConvNet

15

Event2Mind: Commonsense Inference on Events, Intents, and ReactionsHannah Rashkin?, Maarten Sap?, Emily Allaway, Noah A. Smith, and Yejin ChoiPaul G. Allen School for Computer Science & EngineeringUniversity of Washington, Seattle, USA?: equal contribution

UWNLPuwnlp.github.io/event2mind

TakeawaysIn this work, we:• introduce the new task of reasoning about and generating

people’s intents and reactions in relation to events• create a new large knowledge graph of 77K nodes that

supports this type of commonsense inference• provide a novel way to analyze gender bias in movies

using commonsense inference

Commonsense InferencePragmatic reasoning about the mental states of

people in relation to events• Numerous AI systems need to antic-

ipate people’s intents and emotionalreactions (e.g., conversational AI, adranking, narrative understanding)

• This type of social commonsense rea-soning goes far beyond the widely stud-ied entailment tasks (e.g., SNLI).

Collecting Inference Data

Annotated dimensionsPersonX’s Intent Why does X cause the event?PersonX’s Reaction How does X feel after the event?Others’ Reaction How do others feel after the event?

• Automatically extracted events from:ROC Stories, Google NGrams,Spinn3r, Wiktionary Idioms1

• Annotated by 3 Amazon MTurkers• Moderate agreement = .45

1Mostafazadeh et al., 2016; Goldberg & Orwant, 2013; Gordon & Swanson, 2008

Commonsense Knowledge Graph

X punches Y' s l i ght s out

X wants to geteven w ith Y

X st ar t s yel l i ng

X wants to hur t Y

X wants to express their anger

Y feels hur t

X feelsgui l ty

X f akes bei ng si ck

X feels angr y X feels

vindicated

as a result

Y feels scared

Y feelssad

as a resultbecause

X wants to get attention

X wants to show surpr ise

X wants to get outof something

X feelsmean

X apol ogi zes pr of usel y

Y feels better

because

X wants to stay home

Y feels vindicated

as a result

Y feelsignored

because

X wants to be forgiven

X wants tomake amends

because

as a result

Knowledge graph statistics# unique events 24,716# unique intents 33,373

# unique reactions 19,083total # nodes 77,172

Annotation statistics• Intents longer than reac-

tions (3.4 vs. 1.5 words)• 26% of events involve

other participants• 22% contain blanks

Evaluation of Generated Inference

20%

30%

40%

50%

Intent XReact OReact

mean-pool ngrammean-pool seqConvNet ngramConvNet seqBiRNN ngramBiRNN seq

• 5 MTurk raters evaluated precision@10 of generations• Compositional encoders (ConvNet, BiRNN) > pooling• Sequence generation is preferred over n-gram re-ranking

Neural Inference Model

Decoder


hE

f(e1, . . . , en) = hE

f � {biRNN, CNN, max-pool, mean-pool}


e1 e2 e3 e4 e5E = { }

Encoder

start a fight <EOS>

GRU GRU GRU GRU

PersonX’s Intent

hE

Decoder (seq2seq)

vi = {start, a, fight} vx = {powerful} vo = {defensive}

PersonX’s Reaction Others’ Reaction

Decoder (seq2seq)

• Encoder-decoder setup; given an event, generate intentsand reactions in a multi-task way

• Two decoders: n-gram re-ranking and sequence generation

Gender Bias in Movies• 21K characters from 772 modern english movies• Parse narratives describing characters into events, generate

intents & reactions (with BiRNN-seq model)• Extract LIWC features from generated inference• Logistic regression to find gender correlates

/18

Analysis of embedding space• Q: Does embedding space capture small differences

between event phrases? (e.g. modify “goes” to “come”)• Take a handful of events, and decode intent, reactions

from interpolations between two event phrases

16

/18

Bonus: Application to narrative analysis

17

Event2Mind: Commonsense Inference on Events, Intents, and ReactionsHannah Rashkin?, Maarten Sap?, Emily Allaway, Noah A. Smith, and Yejin ChoiPaul G. Allen School for Computer Science & EngineeringUniversity of Washington, Seattle, USA?: equal contribution

UWNLPuwnlp.github.io/event2mind

TakeawaysIn this work, we:• introduce the new task of reasoning about and generating

people’s intents and reactions in relation to events• create a new large knowledge graph of 77K nodes that

supports this type of commonsense inference• provide a novel way to analyze gender bias in movies

using commonsense inference

Commonsense InferencePragmatic reasoning about the mental states of

people in relation to events• Numerous AI systems need to antic-

ipate people’s intents and emotionalreactions (e.g., conversational AI, adranking, narrative understanding)

• This type of social commonsense rea-soning goes far beyond the widely stud-ied entailment tasks (e.g., SNLI).

Collecting Inference Data

Annotated dimensionsPersonX’s Intent Why does X cause the event?PersonX’s Reaction How does X feel after the event?Others’ Reaction How do others feel after the event?

• Automatically extracted events from:ROC Stories, Google NGrams,Spinn3r, Wiktionary Idioms1

• Annotated by 3 Amazon MTurkers• Moderate agreement = .45

1Mostafazadeh et al., 2016; Goldberg & Orwant, 2013; Gordon & Swanson, 2008

Commonsense Knowledge Graph

X punches Y' s l i ght s out

X wants to geteven w ith Y

X st ar t s yel l i ng

X wants to hur t Y

X wants to express their anger

Y feels hur t

X feelsgui l ty

X f akes bei ng si ck

X feels angr y X feels

vindicated

as a result

Y feels scared

Y feelssad

as a resultbecause

X wants to get attention

X wants to show surpr ise

X wants to get outof something

X feelsmean

X apol ogi zes pr of usel y

Y feels better

because

X wants to stay home

Y feels vindicated

as a result

Y feelsignored

because

X wants to be forgiven

X wants tomake amends

because

as a result

Knowledge graph statistics# unique events 24,716# unique intents 33,373

# unique reactions 19,083total # nodes 77,172

Annotation statistics• Intents longer than reac-

tions (3.4 vs. 1.5 words)• 26% of events involve

other participants• 22% contain blanks

Evaluation of Generated Inference

20%

30%

40%

50%

Intent XReact OReact

mean-pool ngrammean-pool seqConvNet ngramConvNet seqBiRNN ngramBiRNN seq

• 5 MTurk raters evaluated precision@10 of generations• Compositional encoders (ConvNet, BiRNN) > pooling• Sequence generation is preferred over n-gram re-ranking

Neural Inference Model

Decoder


hE

f(e1, . . . , en) = hE

f � {biRNN, CNN, max-pool, mean-pool}


e1 e2 e3 e4 e5E = { }

Encoder

start a fight <EOS>

GRU GRU GRU GRU

PersonX’s Intent

hE

Decoder (seq2seq)

vi = {start, a, fight} vx = {powerful} vo = {defensive}

PersonX’s Reaction Others’ Reaction

Decoder (seq2seq)

• Encoder-decoder setup; given an event, generate intentsand reactions in a multi-task way

• Two decoders: n-gram re-ranking and sequence generation

Gender Bias in Movies• 21K characters from 772 modern english movies• Parse narratives describing characters into events, generate

intents & reactions (with BiRNN-seq model)• Extract LIWC features from generated inference• Logistic regression to find gender correlates

Women

Juno laughs and hugs her father, planting a smooch on his cheek.

- Juno (2007)

because

Dignan punches Anthony in the face.

- Bottle Rocket (1996)

hurtangry

sadupsetas a

result

Intents: to be helpful, to be felt/seen

Reactions: pro-social, sexual

Intents: to accomplish, violence

Reactions: negative, achievement

Menshow affection

show loveloving

none

/18

Contributions1. New corpus of stereotypical intents and reactions

over 25,000 everyday events and situations–Task formulation that aims to generate the textual

descriptions of intents and reactions (c.f. sentiment analysis, textual entailment)

2. Establish baseline performance (based on neural encoder-decoder models) … and some modeling tips

3. Demonstrate use-case of commonsense inference model on movie script analysis (if time allows…)

18

Backup slides

19

/18209

into more explicit statements, helping to identifyand explain gender bias that is prevalent in modernliterature and media. Specifically, our results indi-cate that modern movies have the bias to portrayfemale characters as having pro-social attitudes,whereas male characters are portrayed as beingcompetitive or pro-achievement. This is consis-tent with gender stereotypes that have been studiedin movies in both NLP and psychology literature(Agarwal et al., 2015; Madaan et al., 2017; Pren-tice and Carranza, 2002; England et al., 2011).

6 Related Work

Prior work has sought formal frameworks for in-ferring roles and other attributes in relation toevents (Baker et al., 1998; Das et al., 2014; Schuleret al., 2009; Hartshorne et al., 2013, inter alia),implicitly connoted by events (Reisinger et al.,2015; White et al., 2016; Greene, 2007; Rashkinet al., 2016), or sentiment polarities of events(Ding and Riloff, 2016; Choi and Wiebe, 2014;Russo et al., 2015; Ding and Riloff, 2018). In ad-dition, recent work has studied the patterns whichevoke certain polarities (Reed et al., 2017), thedesires which make events affective (Ding et al.,2017), the emotions caused by events (Vu et al.,2014), or, conversely, identifying events or rea-soning behind particular emotions (Gui et al.,2017). Compared to this prior literature, our workuniquely learns to model intents and reactions overa diverse set of events, includes inference overevent participants not explicitly mentioned in text,and formulates the task as predicting the textualdescriptions of the implied commonsense insteadof classifying various event attributes.

Previous work in natural language inference hasfocused on linguistic entailment (Bowman et al.,2015; Bos and Markert, 2005) while ours fo-cuses on commonsense-based inference. Therealso has been inference or entailment work thatis more generation focused: generating, e.g., en-tailed statements (Zhang et al., 2017; Blouw andEliasmith, 2018), explanations of causality (Kanget al., 2017), or paraphrases (Dong et al., 2017).Our work also aims at generating inferences fromsentences; however, our models infer implicit in-formation about mental states and causality, whichhas not been studied by most previous systems.

Also related are commonsense knowledgebases (Espinosa and Lieberman, 2005; Speer andHavasi, 2012). Our work complements these ex-

isting resources by providing commonsense rela-tions that are relatively less populated in previ-ous work. For instance, ConceptNet contains only25% of our events, and only 12% have relationsthat resemble intent and reaction. We present amore detailed comparison with ConceptNet in Ap-pendix C.

7 Conclusion

We introduced a new corpus, task, and model forperforming commonsense inference on textually-described everyday events, focusing on stereotyp-ical intents and reactions of people involved in theevents. Our corpus supports learning representa-tions over a diverse range of events and reason-ing about the likely intents and reactions of pre-viously unseen events. We also demonstrate thatsuch inference can help reveal implicit gender biasin movie scripts.

Acknowledgments

We thank the anonymous reviewers for their in-sightful comments. We also thank xlab membersat the University of Washington, Martha Palmer,Tim O’Gorman, Susan Windisch Brown, Ghaza-leh Kazeminejad as well as other members at theUniversity of Colorado at Boulder for many help-ful comments for our development of the annota-tion pipeline. This work was supported in part byNational Science Foundation Graduate ResearchFellowship Program under grant DGE-1256082,NSF grant IIS-1714566, and the DARPA CwCprogram through ARO (W911NF-15-1-0543).

References

Martın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S. Cor-rado, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,Geoffrey Irving, Michael Isard, Yangqing Jia, RafalJozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dandelion Mane, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schus-ter, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,Kunal Talwar, Paul Tucker, Vincent Vanhoucke,Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke,Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow:Large-scale machine learning on heterogeneous sys-tems. Software available from tensorflow.org.

Apoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sri-ramkumar Balasubramanian, and Shirin Ann Dey.2015. Key female characters in film have more to

9

into more explicit statements, helping to identifyand explain gender bias that is prevalent in modernliterature and media. Specifically, our results indi-cate that modern movies have the bias to portrayfemale characters as having pro-social attitudes,whereas male characters are portrayed as beingcompetitive or pro-achievement. This is consis-tent with gender stereotypes that have been studiedin movies in both NLP and psychology literature(Agarwal et al., 2015; Madaan et al., 2017; Pren-tice and Carranza, 2002; England et al., 2011).

6 Related Work

Prior work has sought formal frameworks for in-ferring roles and other attributes in relation toevents (Baker et al., 1998; Das et al., 2014; Schuleret al., 2009; Hartshorne et al., 2013, inter alia),implicitly connoted by events (Reisinger et al.,2015; White et al., 2016; Greene, 2007; Rashkinet al., 2016), or sentiment polarities of events(Ding and Riloff, 2016; Choi and Wiebe, 2014;Russo et al., 2015; Ding and Riloff, 2018). In ad-dition, recent work has studied the patterns whichevoke certain polarities (Reed et al., 2017), thedesires which make events affective (Ding et al.,2017), the emotions caused by events (Vu et al.,2014), or, conversely, identifying events or rea-soning behind particular emotions (Gui et al.,2017). Compared to this prior literature, our workuniquely learns to model intents and reactions overa diverse set of events, includes inference overevent participants not explicitly mentioned in text,and formulates the task as predicting the textualdescriptions of the implied commonsense insteadof classifying various event attributes.

Previous work in natural language inference hasfocused on linguistic entailment (Bowman et al.,2015; Bos and Markert, 2005) while ours fo-cuses on commonsense-based inference. Therealso has been inference or entailment work thatis more generation focused: generating, e.g., en-tailed statements (Zhang et al., 2017; Blouw andEliasmith, 2018), explanations of causality (Kanget al., 2017), or paraphrases (Dong et al., 2017).Our work also aims at generating inferences fromsentences; however, our models infer implicit in-formation about mental states and causality, whichhas not been studied by most previous systems.

Also related are commonsense knowledgebases (Espinosa and Lieberman, 2005; Speer andHavasi, 2012). Our work complements these ex-

isting resources by providing commonsense rela-tions that are relatively less populated in previ-ous work. For instance, ConceptNet contains only25% of our events, and only 12% have relationsthat resemble intent and reaction. We present amore detailed comparison with ConceptNet in Ap-pendix C.

7 Conclusion

We introduced a new corpus, task, and model forperforming commonsense inference on textually-described everyday events, focusing on stereotyp-ical intents and reactions of people involved in theevents. Our corpus supports learning representa-tions over a diverse range of events and reason-ing about the likely intents and reactions of pre-viously unseen events. We also demonstrate thatsuch inference can help reveal implicit gender biasin movie scripts.

Acknowledgments

We thank the anonymous reviewers for their in-sightful comments. We also thank xlab membersat the University of Washington, Martha Palmer,Tim O’Gorman, Susan Windisch Brown, Ghaza-leh Kazeminejad as well as other members at theUniversity of Colorado at Boulder for many help-ful comments for our development of the annota-tion pipeline. This work was supported in part byNational Science Foundation Graduate ResearchFellowship Program under grant DGE-1256082,NSF grant IIS-1714566, and the DARPA CwCprogram through ARO (W911NF-15-1-0543).

References

Martın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S. Cor-rado, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,Geoffrey Irving, Michael Isard, Yangqing Jia, RafalJozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dandelion Mane, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schus-ter, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,Kunal Talwar, Paul Tucker, Vincent Vanhoucke,Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke,Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow:Large-scale machine learning on heterogeneous sys-tems. Software available from tensorflow.org.

Apoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sri-ramkumar Balasubramanian, and Shirin Ann Dey.2015. Key female characters in film have more to

/1821

3

matic or commonsense interpretation. We scopeour study to two distinct types of inference: givena phrase that describes an event, we want to reasonabout the likely intents and emotional reactions ofpeople who caused or affected by the event. Thiscomplements prior work on more general com-monsense inference (Speer and Havasi, 2012; Liet al., 2016; Zhang et al., 2017), by focusing onthe causal relations between events and people’smental states, which are not well covered by mostexisting resources.

We collect a wide range of phrasal event de-scriptions from stories, blogs, and Wiktionary id-ioms. Compared to prior work on phrasal em-beddings (Wieting et al., 2015; Pavlick et al.,2015), our work generalizes the phrases by in-troducing (typed) variables. In particular, we re-place words that correspond to entity mentions orpronouns with typed variables such as PersonXor PersonY, as shown in examples in Table 1.More formally, the phrases we extract are a com-bination of a verb predicate with partially instan-tiated arguments. We keep specific argumentstogether with the predicate, if they appear fre-quently enough (e.g., PersonX eats pastafor dinner). Otherwise, the arguments arereplaced with an untyped blank (e.g., PersonXeats for dinner). In our work, onlyperson mentions are replaced with typed variables,leaving other types to future research.

Inference types The first type of pragmatic in-ference is about intent. We define intent as anexplanation of why the agent causes a volitionalevent to occur (or “none” if the event phrase wasunintentional). The intent can be considered amental pre-condition of an action or an event. Forexample, if the event phrase is PersonX takesa stab at , the annotated intent might bethat “PersonX wants to solve a problem”.

The second type of pragmatic inference is aboutemotional reaction. We define reaction as an ex-planation of how the mental states of the agent andother people involved in the event would changeas a result. The reaction can be considered a men-tal post-condition of an action or an event. Forexample, if the event phrase is that PersonXgives PersonY as a gift, PersonXmight “feel good about themselves” as a result,and PersonY might “feel grateful” or “feel thank-ful”.

Source # UniqueEvents

# UniqueVerbs

Average

ROC Story 13,627 639 0.57G. N-grams 7,066 789 0.39Spinn3r 2,130 388 0.41Idioms 1,916 442 0.42

Total 24,716 1,333 0.45

Table 2: Data and annotation agreement statisticsfor our new phrasal inference corpus. Each eventis annotated by three crowdworkers.

2.1 Event Extraction

We extract phrasal events from three different cor-pora for broad coverage: the ROC Story train-ing set (Mostafazadeh et al., 2016), the GoogleSyntactic N-grams (Goldberg and Orwant, 2013),and the Spinn3r corpus (Gordon and Swanson,2008). We derive events from the set of verbphrases in our corpora, based on syntactic parses(Klein and Manning, 2003). We then replace thepredicate subject and other entities with the typedvariables (e.g., PersonX, PersonY), and selec-tively substitute verb arguments with blanks ( ).We use frequency thresholds to select events toannotate (for details, see Appendix A.1). Addi-tionally, we supplement the list of events with all2,000 verb idioms found in Wiktionary, in order tocover events that are less compositional.2 Our fi-nal annotation corpus contains nearly 25,000 eventphrases, spanning over 1,300 unique verb predi-cates (Table 2).

2.2 Crowdsourcing

We design an Amazon Mechanical Turk task toannotate the mental pre- and post-conditions ofevent phrases. A snippet of our MTurk HIT de-sign is shown in Figure 2. For each phrase, we askthree annotators whether the agent of the event,PersonX, intentionally causes the event, and ifso, to provide up to three possible textual de-scriptions of their intents. We then ask anno-tators to provide up to three possible reactionsthat PersonX might experience as a result. Wealso ask annotators to provide up to three pos-sible reactions of other people, when applicable.These other people can be either explicitly men-tioned (e.g., “PersonY” in PersonX punchesPersonY’s lights out), or only implied

2We compiled the list of idiomatic verb phrases by cross-referencing between Wiktionary’s English idioms categoryand the Wiktionary English verbs categories.

Documents

NaoyaInoue Tohoku Universitynaoya-i/resources/180803_saisentannlp.pdf · NaoyaInoue Tohoku University RIKEN AIP Proceedings of the 56th Annual Meeting of the Association for Computational