10
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6442–6451 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 6442 What Should I Ask? Using Conversationally Informative Rewards for Goal-Oriented Visual Dialogue Pushkar Shuklar 1 , Carlos Elmadjian 2 , Richika Sharan 1 , Vivek Kulkarni 3 , William Yang Wang 1 , Matthew Turk 1 1 University of California, Santa Barbara 2 University of S˜ ao Paulo 3 Stanford University 1 {pushkarshukla,richikasharan,wangwilliamyang,mturk}@ucsb.edu 2 [email protected] 3 [email protected] Abstract The ability to engage in goal-oriented conver- sations has allowed humans to gain knowl- edge, reduce uncertainty, and perform tasks more efficiently. Artificial agents, however, are still far behind humans in having goal- driven conversations. In this work, we fo- cus on the task of goal-oriented visual dia- logue, aiming to automatically generate a se- ries of questions about an image with a sin- gle objective. This task is challenging, since these questions must not only be consistent with a strategy to achieve a goal, but also consider the contextual information in the im- age. We propose an end-to-end goal-oriented visual dialogue system, that combines rein- forcement learning with regularized informa- tion gain. Unlike previous approaches that have been proposed for the task, our work is motivated by the Rational Speech Act frame- work, which models the process of human in- quiry to reach a goal. We test the two versions of our model on the GuessWhat?! dataset, ob- taining significant results that outperform the current state-of-the-art models in the task of generating questions to find an undisclosed ob- ject in an image. 1 Introduction Building natural language models that are able to converse towards a specific goal is an active area of research that has attracted a lot of attention in recent years. These models are vital for efficient human-machine collaboration, such as when inter- acting with personal assistants. In this paper, we focus on the task of goal-oriented visual dialogue, which requires an agent to engage in conversations about an image with a predefined objective. The task presents some unique challenges. Firstly, the conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent with the common vi- Figure 1: An example of goal-oriented visual dialogue for finding an undisclosed object in an image through a series of questions. On the left, we ask a human to guess the unknown object in the image. On the right, we use the baseline model proposed by Strub et al. (Strub et al., 2017). While the human is able to nar- row down the search space relatively faster, the artifi- cial agent is not able to adopt a clear strategy for guess- ing the object. sual feedback. Finally, the agents should come up with a strategy to achieve the objective in the shortest possible way. This is different from a nor- mal dialogue system where there is no constraint on the length of a conversation. Inspired by the success of Deep Reinforcement Learning, many recent works have also used it for building models for goal-oriented visual dialogue (Bordes et al., 2017). The choice makes sense, as reinforcement learning is well suited for tasks that require a set of actions to reach a goal. How- ever, the performance of these models have been sub-optimal when compared to the average human performance on the same task. For example, con- sider the two conversations shown in Figure 1. The figure draws a comparison between possible ques-

What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6442–6451Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

6442

What Should I Ask? Using Conversationally Informative Rewards forGoal-Oriented Visual Dialogue

Pushkar Shuklar1, Carlos Elmadjian2, Richika Sharan1, Vivek Kulkarni3,William Yang Wang1, Matthew Turk1

1University of California, Santa Barbara2University of Sao Paulo

3Stanford University1{pushkarshukla,richikasharan,wangwilliamyang,mturk}@ucsb.edu

2 [email protected] [email protected]

Abstract

The ability to engage in goal-oriented conver-sations has allowed humans to gain knowl-edge, reduce uncertainty, and perform tasksmore efficiently. Artificial agents, however,are still far behind humans in having goal-driven conversations. In this work, we fo-cus on the task of goal-oriented visual dia-logue, aiming to automatically generate a se-ries of questions about an image with a sin-gle objective. This task is challenging, sincethese questions must not only be consistentwith a strategy to achieve a goal, but alsoconsider the contextual information in the im-age. We propose an end-to-end goal-orientedvisual dialogue system, that combines rein-forcement learning with regularized informa-tion gain. Unlike previous approaches thathave been proposed for the task, our work ismotivated by the Rational Speech Act frame-work, which models the process of human in-quiry to reach a goal. We test the two versionsof our model on the GuessWhat?! dataset, ob-taining significant results that outperform thecurrent state-of-the-art models in the task ofgenerating questions to find an undisclosed ob-ject in an image.

1 Introduction

Building natural language models that are able toconverse towards a specific goal is an active areaof research that has attracted a lot of attention inrecent years. These models are vital for efficienthuman-machine collaboration, such as when inter-acting with personal assistants. In this paper, wefocus on the task of goal-oriented visual dialogue,which requires an agent to engage in conversationsabout an image with a predefined objective. Thetask presents some unique challenges. Firstly, theconversations should be consistent with the goalsof the agent. Secondly, the conversations betweentwo agents must be coherent with the common vi-

Figure 1: An example of goal-oriented visual dialoguefor finding an undisclosed object in an image througha series of questions. On the left, we ask a humanto guess the unknown object in the image. On theright, we use the baseline model proposed by Strub etal. (Strub et al., 2017). While the human is able to nar-row down the search space relatively faster, the artifi-cial agent is not able to adopt a clear strategy for guess-ing the object.

sual feedback. Finally, the agents should comeup with a strategy to achieve the objective in theshortest possible way. This is different from a nor-mal dialogue system where there is no constrainton the length of a conversation.

Inspired by the success of Deep ReinforcementLearning, many recent works have also used it forbuilding models for goal-oriented visual dialogue(Bordes et al., 2017). The choice makes sense,as reinforcement learning is well suited for tasksthat require a set of actions to reach a goal. How-ever, the performance of these models have beensub-optimal when compared to the average humanperformance on the same task. For example, con-sider the two conversations shown in Figure 1. Thefigure draws a comparison between possible ques-

Page 2: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6443

tions asked by humans and an autonomous agentproposed by Strub et al. (Strub et al., 2017) to lo-cate an undisclosed object in the image. While hu-mans tend to adopt strategies to narrow down thesearch space, bringing them closer to the goal, it isnot clear whether an artificial agent is capable oflearning a similar behavior only by looking at a setof examples. This leads us to pose two questions:What strategies do humans adopt while coming upwith a series of questions with respect to a goal?;and Can these strategies be used to build modelsthat are suited for goal-oriented visual dialogue?

With this challenge in mind, we directed ourattention to contemporary works in the field ofcognitive science, linguistics and psychology formodelling human inquiry (Groenendijk et al.,1984; Nelson, 2005; Van Rooy, 2003). Morespecifically, our focus lies on how humans comeup with a series of questions in order to reach aparticular goal. One popular theory suggests thathumans try to maximize the expected regularizedinformation gain while asking questions (Hawkinset al., 2015; Coenen et al., 2017). Motivated bythat, we evaluate the utility of using informationgain for goal-oriented visual question generationwith a reinforcement learning paradigm. In thispaper, we propose two different approaches fortraining an end-to-end architecture: first, a novelreward function that is a trade-off between the ex-pected information gain of a question and the costof asking it; and second, a loss function that usesregularized information gain with a step-based re-ward function. Our architecture is able to generategoal-oriented questions without using any priortemplates. Our experiments are performed on theGuessWhat?! dataset (De Vries et al., 2017), astandard dataset for goal-oriented visual dialoguethat focuses on identifying an undisclosed objectin the image through a series of questions. Thus,our contribution is threefold:

• An end-to-end architecture for goal-orientedvisual dialogue combining Information Gainwith Reinforcement Learning.

• A novel reward function for goal-oriented vi-sual question generation to model long-termdependencies in dialogue.

• Both versions of our model outperform thecurrent baselines on the GuessWhat?! datasetfor the task of identifying an undisclosed ob-

ject in an image by asking a series of ques-tions.

2 Related Work

2.1 Models for Human Inquiry

There have been several works in the area of cog-nitive science that focus on models for questiongeneration. Groenendijk et al. (Groenendijk et al.,1984) proposed a theory stating that meaning-ful questions are propositions conditioned by thequality of its answers. Van Rooy (Van Rooy, 2003)suggested that the value of a question is propor-tional to the questioner’s interest and the answerthat is likely to be provided. Many recent relatedmodels take into consideration the optimal exper-imental design (OED) (Nelson, 2005; Gureckisand Markant, 2012), which considers that humansperform intuitive experiments to gain information,while others resort to Bayesian inference. Coenenet al. (Coenen et al., 2017), for instance, cameup with nine important questions about humaninquiry, while one recent model called RationalSpeech Act (RSA) (Hawkins et al., 2015) consid-ers questions as a distribution that is proportionalto the trade-off between the expected informationgain and the cost of asking a question.

2.2 Dialogue Generation and Visual Dialogue

Dialogue generation is an important research topicin NLP, thus many approaches have been pro-posed to address this task. Most earlier worksmade use of a predefined template (Lemon et al.,2006; Wang and Lemon, 2013) to generate dia-logues. More recently, deep neural networks havebeen used for building end-to-end architecturescapable of generating questions (Vinyals and Le,2015; Sordoni et al., 2015) and also for the taskof goal-oriented dialogue generation (Rajendranet al., 2018; Bordes et al., 2017).

Visual dialogue focuses on having a conversa-tion about an image with either one or both of theagents being a machine. Since its inception (Daset al., 2017), different approaches have been pro-posed to address this problem (Massiceti et al.,2018; Lu et al., 2017; Das et al., 2017). Goal-oriented Visual Dialogue, on the other hand, is anarea that has only been introduced fairly recently.De Vries et al. (De Vries et al., 2017) proposedthe GuessWhat?! dataset for goal-oriented visualdialogue while Strub et al. (Strub et al., 2017)developed a reinforcement learning approach for

Page 3: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6444

Figure 2: A block diagram of our model. The framework is trained on top of three individual models: the questioner(QGen), the guesser, and the oracle. The guesser returns an object distribution given a history of question-answerpairs that are generated by the questioner and the oracle respectively. These distributions are used for calculatingthe information gain of the question-answer pair. The information gain and distribution of probabilities givenby the Guesser are used either as a reward or optimized as a loss function with global rewards for training thequestioner.

goal-oriented visual question generation. More re-cently, Zhang et al. (Zhang et al., 2018) used inter-mediate rewards for training a model on this task.

2.3 Sampling Questions with InformationGain

Information gain has been used before to buildquestion-asking agents, but most of these modelsresort to it to sample questions. Rothe et al. (Rotheet al., 2017) proposed a model that generates ques-tions in a Battleship game scenario. Their modeluses Expected Information Gain to come up withquestions akin to what humans would ask. Lee etal. (Lee et al., 2018) used information gain aloneto sample goal-oriented questions on the Guess-What?! task in a non-generative fashion. Themost similar work to ours was proposed by Lip-ton et al. (Lipton et al., 2017), who used informa-tion gain and Q-learning to generate goal-orientedquestions for movie recommendations. However,they generated questions using a template-basedquestion generator.

3 The GuessWhat?! framework

We built our model based on the GuessWhat?!framework (De Vries et al., 2017). GuessWhat?!is a two-player game in which both players aregiven access to an image containing multiple ob-jects. One of the players – the oracle – choosesan object in the image. The goal of the otherplayer – the questioner – is to identify this object

by asking a series of questions to the oracle, whocan only give three possible answers: ”yes,” ”no,”or ”not applicable.” Once enough evidence is col-lected, the questioner has to choose the correct ob-ject from a set of possibilities – which, in the caseof an artificial agent, are evaluated by a guessermodule. If this final guess is correct, the ques-tioner is declared the winner. The GuessWhat?!dataset comprises 155,280 games on 66,537 im-ages from the MS-COCO dataset, with 831,889question-answer pairs. The dataset has 134,074unique objects and 4,900 words in the vocabulary.

A game is comprised of an image I withheight H and width W , a dialogue D ={(q1, a1), (q2, a3), ...(qn, an)}, where qj ∈ Q de-notes a question from a list of questions and aj ∈A denotes an answer from a list of answers, whichcan either be 〈yes〉, 〈no〉 or 〈N/A〉. The total num-ber of objects in the image is denoted byO and thetarget is denoted by o∗. The term V indicates thevocabulary that comprises all the words that areemployed to train the question generation mod-ule (QGen). Each question can be represented byq = {wi}, where wi denotes the ith word in thevocabulary. The set of segmentation masks of ob-jects is denoted by S. These notations are similarto those of Strub et al. (Strub et al., 2017). An ex-ample of a game can be seen in Figure 1, where thequestioner generates a series of questions to guessthe undisclosed object. In the end, the guesser triesto predict the object with the image and the given

Page 4: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6445

set of question-answer pairs.

3.1 Learning EnvironmentWe now describe the preliminary models for thequestioner, the guesser, and the oracle. Before us-ing them for the GuessWhat?! task, we pre-trainall three models in a supervised manner. Duringthe final training of the Guesswhat?! task our fo-cus is on building a new model for the questionerand we use the existing pre-trained models for theoracle and the guesser.

3.1.1 The QuestionerThe questioner’s job is to generate a new questionqj+1 given the previous j question-answer pairsand the image I . Our model has a similar archi-tecture to the VQG model proposed by Strub etal. (Strub et al., 2017). It consists of an LSTMwhose inputs are the representations of the corre-sponding image I and the input sequence corre-sponds to the previous dialogue history. The rep-resentations of the image are extracted from thefc-8 layer of the VGG16 network (Simonyan andZisserman, 2014). The output of the LSTM is aprobability distribution over all words in the vo-cabulary. The questioner is trained in a supervisedfashion by minimizing the following negative log-likelihood loss function:

Lques = − logpq(q1:J |I, a1:J)

= −J∑j=1

Ij∑i=1

logpq(wji |w

j1:i−1, (q, a)1:j−1, I)

(1)

Samples are generated in the following mannerduring testing: given an initial state s0 and newtoken wj0, a word is sampled from the vocabu-lary. The sampled word along with the previousstate is given as the input to the next state of theLSTM. The process is repeated until the output ofthe LSTM is the 〈end〉 token.

3.1.2 The OracleThe job of the oracle is to come up with an answerto each question that is posed. In our case, thethree possible outcomes are 〈yes〉, 〈no〉, or 〈N/A〉.The architecture of the oracle model is similar tothe one proposed by De Vries et al. (De Vrieset al., 2017). The input to the oracle is an image,a category vector, and the question that is encodedusing an LSTM. The model then returns a distri-bution over the possible set of answers.

3.1.3 The Guesser

The job of the guesser is to return a distributionof probabilities over all set of objects given the in-put image and the dialogue history. We convertthe entire dialogue history into a single encodedvector using an LSTM. All objects are embeddedinto vectors, and the dot product of these embed-dings are performed with the encoded vector con-taining the dialogue history. The dot product isthen passed through an MLP layer that returns thedistribution over all objects.

4 Regularized Information Gain

The motivation behind using Regularized Infor-mation Gain (RIG) for goal-oriented question-asking comes from the Rational Speech Act Model(RSA) (Hawkins et al., 2015). RSA tries to math-ematically model the process of human question-ing and answering. According to this model, whenselecting a question from a set of questions, thequestioner considers a goal g ∈ G with respect tothe world state G and returns a probability distri-bution of questions such that:

P (q|g) ∝ eDKL(∧p(q|g)||∼p(q|g))−C(q) (2)

where P (q|g) represents probability of selecting aquestion q from a set of questions Q. The prob-ability is directly proportional to the trade-off be-tween the cost of asking a question C(q) and theexpected information gain DKL(

∼p(q|g)||∧p(q|g)).

The cost may depend on several factors such as thelength of the question, the similarity with previ-ously asked questions, or the number of questionsthat may have been asked before. The informationgain is defined as the KL divergence between theprior distribution of the world with respect to thegoal,

∼p(q|g), and the posterior distribution that the

questioner would expect after asking a question,∧p(q|g).

Similar to Equation 2, in our model we makeuse of the trade-off between expected informa-tion gain and the cost of asking a question forgoal-oriented question generation. Since the costterm regularizes the expected information gain, wedenote this trade-off as Regularized InformationGain. For a given question q, the Regularized In-formation Gain is given as:

RIG(q) = τ(q)− C(q)) (3)

Page 5: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6446

where τ(q) is the expected information gain asso-ciated with asking a question and C(q) is the costof asking a question q ∈ Q in a given game. Thus,the information gain is measured as the KL diver-gence between the prior and posterior likelihoodsof the scene objects before and after a certain ques-tion is made, weighted by a skewness coefficientβ(q) over the same posterior.

τ(q) = DKL(∼p(qj |I, (q, a)1:j−1)||

∧p(qj |I, (q, a)1:j−1))β(q)

(4)

The prior distribution before the start of thegame is assumed to be 1

N , where N is the totalnumber of objects in the game. After a questionis asked, the prior distribution is updated and it isequal to the output distribution of the guesser:

∼p(qj |I, (q, a)1:j−1 =

{pguess(I, (q, a)1:j−1), if i ≥ 11N, if i = 0

(5)

We define the posterior to be the output of theguesser once the answer has been given by the or-acle:

∧p(qj |I, (q, a))1:j−1) =

A∑a∈A

pguess(qj |I, (q, a)1:j−1)

(6)The idea behind using skewness is to reward

questions that lead to a more skewed distributionat each round. The implication is that a smallergroup of objects with higher probabilities lowersthe chances of making a wrong guess by the end ofthe game. Additionally, the measure of skewnessalso works as a counterweight to certain scenarioswhere KL divergence itself should not reward theoutcome of a question, such as when there is a sig-nificant information gain from a previous state butthe distribution of likely objects, according to theguesser, becomes mostly homogeneous after thequestion.

Since we assume that initially all objects areequally likely to be the target, the skewness ap-proach is only applied after the first question.We use the posterior distribution provided by theguesser to extract the Pearson’s second skewnesscoefficient (i.e., the median skewness) and createthe β component. Therefore, assuming a samplemean µ, median m, and standard deviation σ, theskewness coefficient is simply given by:

β(q) =3(µ−m)

σ(7)

Some questions might have a high informationgain, but at a considerable cost. The term C(q)

acts as a regularizing component to informationgain and controls what sort of questions should beasked by the questioner. The cost of asking a ques-tion can be defined in many ways and may differfrom one scenario to another. In our case, we areonly considering whether a question is being askedmore than once, since a repeated question cannotprovide any new evidence that will help get closerto the target, despite a high information gain fromone state to another during a complete dialogue.The cost for a repeated question is defined as:

C(q) =

{τ(q), if qj ∈ {qj−1, ..., q1}0, otherwise

(8)

The cost for a question is equal to the negativeinformation gain. This sets the value of an inter-mediate reward to 0 for a repeated question, ensur-ing that the net RIG is zero when the question isrepeated.

5 Our Model

We view the task of generating goal-orientedvisual questions as a Markov Decision Process(MDP), and we optimize it using the Policy Gra-dient algorithm. In this section, we describe someof the basic terminology employed in our modelbefore moving into the specific aspects of it.

At any time instance t, the state of the agent canbe written as ut = ((wj1, ..., w

jm), (q, a)1:j−1, I),

where I is the image of interest, (q, a)1:j−1 is thequestion-answer history, and (wj1, ..., w

jm) is the

previously generated sequence of words for thecurrent question qj . The action vt denotes the se-lection of the next output token from all the tokensin the vocabulary. All actions can lead to one ofthe following outcomes:

1. The selected token is 〈stop〉, marking the endof the dialogue. This shows that it is now theturn of the guesser to make the guess.

2. The selected token is 〈end〉, marking the endof a question.

3. The selected token is another word from thevocabulary. The word is then appended to thecurrent sequence (wj1, ..., w

jm). This marks

the start of the next state.

Our approach models the task of goal-orientedquestioning as an optimal stochastic policyπθ(v|u) over the possible set of state-action pairs.

Page 6: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6447

Algorithm 1 Training the question generator us-ing REINFORCE with the proposed rewardsRequire: Pretrained QGen, Oracle and GuesserRequire: Batch size K

1: for Each update do2: for k = 1 to K do3: Pick image Ik and the target object o∗k ∈ Ok4: N ← |Ok|5:

∼p(ok1:N )← 1

N6: for j ← 1 to Jmax do7: qkj ← QGen((q, a)k1:j−1, Ik)8: akj ← Oracle(qkj , o

∗k, Ik)

9:∧p(ok1:N )← Guesser((q, a)k1:j , Ik, Ok)

10: β(qkj )← Skewness(∧p(ok1:N ))

11: τ(qkj )← DKL(∼p(ok1:N )||∧p(ok1:N ))β(qkj )

12: C(qkj )←

{τ(qkj ) if qj ∈ {qj−1, ..., q1}0 Otherwise

13: R←∑Jmaxj=1 τ(qkj )− C(qkj )

14: p(ok|·)← Guesser((q, a)k1:j , Ik, Ok)

15: r(ut, vt)←

{R If argmax p(ok|·) = o∗k0 Otherwise

16: Define Th ← ((q, a)k1:jk , Ik, rk)1:K17: Evaluate∇J(θh) with Eq.13 with Th18: SGD update of QGen parameters θ using∇J(θh)19: Evaluate∇L(φh) with Eq.15 with Th20: SGD update of baseline parameters using∇L(φh)

Here θ represents the parameters present in our ar-chitecture for question generation. In this work,we experiment with two different settings to trainour model with Regularized Information Gain andpolicy gradients. In the first setting, we use Reg-ularized Information Gain as an additional term inthe loss function of the questioner. We then train itusing policy gradients with a 0-1 reward function.In the second setting, we use Regularized Infor-mation Gain to reward our model. Both methodsare described below.

5.1 Regularized Information Gain lossminimization with 0-1 rewards

During the training of the GuessWhat?! game weintroduce Regularized Information Gain as an ad-ditional term in the loss function. The goal is tominimize the negative log-likelihood and maxi-mize the Regularized Information Gain. The lossfunction for the questioner is given by:

L(θ) = − logpq(q1:J |I, a1:J) + τ(q)− C(q)

= −J∑j=1

Ij∑i=1

logpq(wji |w

j1:i−1, (q, a)1:j−1, I)

+DKL(∧p(qj |I, (q, a))||

∼p(qj |I, (q, a)))β(q)

(9)

We adopt a reinforcement learning paradigm ontop of the proposed loss function. We use a zero-one reward function similar to Strub et al. (Strubet al., 2017) for training our model. The rewardfunction is given as:

r(ut, vt) =

{1, if argmaxpguess = o∗

0, otherwise(10)

Thus, we give a reward of 1 if the guesser is ableto guess the right object and 0 otherwise.

5.2 Using Regularized Information Gain as areward

Defining a valuable reward function is a crucialaspect for any Reinforcement Learning problem.There are several factors that should be consideredwhile designing a good reward function for askinggoal-oriented questions. First, the reward func-tion should help the questioner achieve its goal.Second, the reward function should optimize thesearch space, allowing the questioner to come upwith relevant questions. The idea behind usingregularized information gain as a reward functionis to take into account the long term dependenciesin dialogue. Regularized information gain as a re-ward function can help the questioner to come upwith an efficient strategy to narrow down a largesearch space. The reward function is given by:

r(ut, vt) =

{∑|Q|j=1(τ(qj)− C(qj)), if argmax pguess = o∗

0, otherwise

(11)

Thus, the reward function is the sum of the trade-off between the information gain τ(q) and the costof asking a question C(q) for all questions Q in agiven game. Our function only rewards the agentif it is able to correctly predict the oracle’s initialchoice.

5.3 Policy GradientsOnce the reward function is defined, we train ourmodel using the policy gradient algorithm. For agiven policy πθ, the objective function of the pol-icy gradient is given by:

J(θ) = Eπθ

[T∑t=1

r(ut, vt)

](12)

According to Sutton et al. (Sutton et al., 2000), thegradient of J(θ) can be written as:

∇J(θ) ≈

⟨T∑t=1

∑vt∈V

∇θlogπθ (ut, vt)(Qπθ (ut, vt)−bφ)

⟩(13)

Page 7: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6448

New Image New ObjectApproach Greedy Beam Sampling Best Greedy Beam Sampling BestBaseline.(Strub et al., 2017) 46.9% 53.0% 45.0% 53.0% 53.4% 46.4% 46.44% 53.4%Strub et al. (Strub et al., 2017) 58.6% 54.3% 63.2% 63.2% 57.5% 53.2% 62.0% 62.0%Zhang et al. (Zhang et al., 2018) 56.1% 54.9% 55.6% 55.6% 56.51% 56.53% 49.2% 56.53%TPG1(ZhaoandTresp, 2018) - - - 62.6% - - - -GDSE-C (Venkatesh et al., 2018) - - - 60.7% - - - 63.3%ISM1(Abbasnejadet al., 2018) - - - 62.1% - - - 64.2%RIG as rewards 59.0% 60.21% 64.06% 64.06% 63.00% 63.08% 65.20% 65.20%RIG as a loss with 0-1 rewards 61.18% 59.79% 65.79% 65.79% 63.19% 62.57% 67.19% 67.19%

Table 1: A comparison of the recognition accuracy of our model with the state of the art model (Strub et al., 2017)and other concurrent models on the GuessWhat?! task for guessing an object in the images from the test set.

where Qπθ(ut, vt) is the state value function givenby the sum of the expected cumulative rewards:

Qπθ (ut, vt) = Eπθ

[T∑t′=t

r(ut, vt)

](14)

Here bφ is the baseline function used for re-ducing the variance. The baseline function is asingle-layered MLP that is trained by minimizingthe squared loss error function given by:

min Lφ =

⟨[bφ − r(ut, vt)]2

⟩(15)

6 Results

The model was trained under the same settingsof (Strub et al., 2017). This was done in order toobtain a more reliable comparison with the pre-existing models in terms of accuracy. After a su-pervised training of the question generator, we ranour reinforcement procedure using the policy gra-dient for 100 epochs on a batch size of 64 with alearning rate of 0.001. The maximum number ofquestions was 8. The baseline model, the oracle,and the guesser were also trained with the samesettings described by (De Vries et al., 2017), in or-der to compare the performance of the two rewardfunctions. The error obtained by the guesser andthe oracle were 35.8% and 21.1%, respectively. 1

Table 1 shows our primary results along withthe baseline model trained on the standard cross-entropy loss for the task of guessing a new objectin the test dataset. We compare our model withthe one presented by (Strub et al., 2017) and otherconcurrent approaches. Table 1 also compares ourmodel with others when objects are sampled usinga uniform distribution (right column).

1In order to have a fair comparison, the results reportedfor TPG (Zhao and Tresp, 2018) and (Abbasnejad et al.,2018) only take into consideration the performance of thequestion generator. We do not report the scores that weregenerated after employing memory network to the guesser.

6.1 Ablation StudyWe performed an ablation analysis over RIG in or-der to identify its main learning components. Theresults of the experiments with the reward functionbased on RIG are presented in Table 2, whereasTable 3 compares the different components of RIGwhen used as a loss function. The results men-tioned under New Images refer to images in thetest set, while the results shown under New Ob-jects refer to the analysis made on the trainingdataset with different undisclosed objects from theones used during training time. For the first set ofexperiments, we compared the performance of in-formation gain vs. RIG with the skewness coeffi-cient for goal-oriented visual question generation.It is possible to observe that RIG is able to achievean absolute improvement of 10.57% over infor-mation gain when used as a reward function anda maximum absolute improvement of 2.8% whenit is optimized in the loss function. Adding theskewness term results in a maximum absolute im-provement of 0.9% for the first case and an im-provement of 2.3% for the second case. Further-more, we compared the performance of the modelwhen trained using RIG but without policy gradi-ents. The model then achieves an improvement of10.35% when information gain is used as a lossfunction.

6.2 Qualitative AnalysisIn order to further analyze the performance of ourmodel, we assess it in terms repetitive questions,since they compromise the framework’s efficiency.We compare our model with the one proposedby (Strub et al., 2017) and calculate the averagenumber of repetitive questions generated for eachdialogue. The model by Strub et al. achieved ascore of 0.82, whereas ours scored 0.36 repeatedquestions per dialogue and 0.27 using RIG as a re-ward function.

Page 8: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6449

Figure 3: A qualitative comparison of our model with the model proposed by Strub et al. (Strub et al., 2017).

Rewards New NewImages Objects

I.G. (greedy) 51.6% 52.4%I.G. + skewness (greedy) 57.5% 62.4%R.I.G. (greedy) 58.8% 63.03%

Table 2: An ablation analysis using Regularized Infor-mation Gain as a reward on the GuessWhat?! dataset.

Approach New NewImages Objects

I.G. as a loss function 51.2% 52.8%with no rewardsI.G. as a loss function 57.3% 61.9%with 0-1 rewards (greedy)I.G. + skewness as a loss function 59.47% 62.44%with 0-1 rewards (greedy)R.I.G. as a loss function 60.18% 63.15%with 0-1 rewards (greedy)

Table 3: An ablation analysis of using Regularized In-formation Gain as a loss function with 0-1 rewards.The figures presented in the table indicate the accuracyof the model on the GuessWhat?! dataset.

7 Discussion

Our model was able to achieve an accuracy of67.19% for the task of asking goal-oriented ques-tions on the GuessWhat?! dataset. This resultis the highest obtained so far among existing ap-proaches on this problem, albeit still far fromhuman-level performance on the same task, re-portedly of 84.4%. Our gains can be explained inpart by how RIG with the skewness component forgoal-oriented VQG constrains the process of gen-erating relevant questions and, at the same time,allows the agent to reduce the search space signif-icantly, similarly to decision trees and reinforce-ment learning, but in a very challenging scenario,since the search space in generative models can besignificantly large.

Our qualitative results also demonstrate that

our approach is able to display certain levelsof strategic behavior and mutual consistency be-tween questions in this scenario, as shown in Fig-ure 3. The same cannot be said about previousapproaches, as the majority of them fail to avoidredundant or other sorts of expendable questions.We argue that our cost function and the skewnesscoefficient both play an important role here, as theformer penalizes synonymic questions and the lat-ter narrows down the set of optimal questions.

Our ablation analysis showed that informationgain alone is not the determinant factor that leadsto improved learning, as hypothesized by Lee etal. (Lee et al., 2018). However, Regularized Infor-mation Gain does have a significant effect, whichindicates that a set of constraints, especially re-garding the cost of making a question, cannot betaken lightly in the context of goal-oriented VQG.

8 Conclusion

In this paper we propose a model for goal-orientedvisual question generation using two different ap-proaches that leverage information gain with rein-forcement learning. Our algorithm achieves im-proved accuracy and qualitative results in com-parison to existing state-of-the-art models on theGuessWhat?! dataset. We also discuss the inno-vative aspects of our model and how performancecould be increased. Our results indicate that RIGis a more promising approach to build better-performing agents capable of displaying strategyand coherence in an end-to-end architecture forVisual Dialogue.

Acknowledgments

We acknowledge partial support of this work bythe Sao Paulo Research Foundation (FAPESP),grant 2015/26802-1.

Page 9: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6450

ReferencesEhsan Abbasnejad, Qi Wu, Iman Abbasnejad,

Javen Shi, and Anton van den Hengel. 2018.An active information seeking model for goal-oriented vision-and-language tasks. arXiv preprintarXiv:1812.06398.

Antoine Bordes, Y-Lan Boureau, and Jason Weston.2017. Learning end-to-end goal-oriented dialog.International Conference on Learning Representa-tions.

Anna Coenen, Jonathan D Nelson, and Todd MGureckis. 2017. Asking the right questions abouthuman inquiry. OpenCoenen, Anna, Jonathan DNelson, and Todd M Gureckis.Asking the RightQuestions About Human Inquiry. PsyArXiv, 13.

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jose MF Moura, Devi Parikh,and Dhruv Batra. 2017. Visual dialog. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, volume 2.

Harm De Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron C Courville.2017. Guesswhat?! visual object discovery throughmulti-modal dialogue. In CVPR, volume 1, page 3.

Jeroen AG Groenendijk, MJB Stokhof, et al. 1984. Onthe semantics of questions and the pragmatics of an-swers. DordrechtForis90676500809789067650083.

Todd M Gureckis and Douglas B Markant. 2012. Self-directed learning: A cognitive and computationalperspective. Perspectives on Psychological Science,7(5):464–481.

Robert XD Hawkins, Andreas Stuhlmuller, Judith De-gen, and Noah D Goodman. 2015. Why do youask? good questions provoke informative answers.In CogSci. Citeseer.

Sang-Woo Lee, Yu-Jung Heo, and Byoung-Tak Zhang.2018. Answerer in questioner’s mind for goal-oriented visual dialogue. NeurIPS.

Oliver Lemon, Kallirroi Georgila, James Henderson,and Matthew Stuttle. 2006. An isu dialogue systemexhibiting reinforcement learning of dialogue poli-cies: generic slot-filling in the talk in-car system. InProceedings of the Eleventh Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Posters & Demonstrations, pages 119–122. Association for Computational Linguistics.

Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li,Faisal Ahmed, and Li Deng. 2017. Bbq-networks:Efficient exploration in deep reinforcement learningfor task-oriented dialogue systems. AAAI.

Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh,and Dhruv Batra. 2017. Best of both worlds: Trans-ferring knowledge from discriminative learning to agenerative visual dialog model. In Advances in Neu-ral Information Processing Systems, pages 313–323.

Daniela Massiceti, N Siddharth, Puneet K Dokania,and Philip HS Torr. 2018. Flipdial: A generativemodel for two-way visual dialogue. image (referredto as captioning), 13:15.

Jonathan D Nelson. 2005. Finding useful questions: onbayesian diagnosticity, probability, impact, and in-formation gain. Psychological review, 112(4):979.

Janarthanan Rajendran, Jatin Ganhotra, Satinder Singh,and Lazaros Polymenakos. 2018. Learning end-to-end goal-oriented dialog with multiple answers.EMNLP.

Anselm Rothe, Brenden M Lake, and Todd Gureckis.2017. Question asking as program generation. InAdvances in Neural Information Processing Sys-tems, pages 1046–1055.

Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. CVPR.

Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A neural network approach to context-sensitive gen-eration of conversational responses. arXiv preprintarXiv:1506.06714.

Florian Strub, Harm De Vries, Jeremie Mary, BilalPiot, Aaron Courville, and Olivier Pietquin. 2017.End-to-end optimization of goal-driven and visu-ally grounded dialogue systems. International JointConference on Artificial Intelligence.

Richard S Sutton, David A McAllester, Satinder PSingh, and Yishay Mansour. 2000. Policy gradi-ent methods for reinforcement learning with func-tion approximation. In Advances in neural informa-tion processing systems, pages 1057–1063.

Robert Van Rooy. 2003. Questioning to resolvedecision problems. Linguistics and Philosophy,26(6):727–763.

Aashish Venkatesh, Ravi Shekhar, Tim Baumgartner,Elia Bruni, Raffaella Bernardi, and RaquelFernandez. 2018. Jointly learning to see, ask, andguesswhat. NAACL.

Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. ICML Deep Learning Workshop 2015.

Zhuoran Wang and Oliver Lemon. 2013. A simpleand generic belief tracking mechanism for the dia-log state tracking challenge: On the believability ofobserved information. In Proceedings of the SIG-DIAL 2013 Conference, pages 423–432.

Junjie Zhang, Qi Wu, Chunhua Shen, Jian Zhang, Jian-feng Lu, and Anton Van Den Hengel. 2018. Goal-oriented visual question generation via intermediaterewards. In European Conference on Computer Vi-sion, pages 189–204. Springer.

Page 10: What Should I Ask? Using Conversationally Informative ...conversations should be consistent with the goals of the agent. Secondly, the conversations between two agents must be coherent

6451

Rui Zhao and Volker Tresp. 2018. Learning goal-oriented visual dialog via tempered policy gradient.In 2018 IEEE Spoken Language Technology Work-shop (SLT), pages 868–875. IEEE.