19
Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes Gordon Christie 1,* , Ankit Laddha 2,* , Aishwarya Agrawal 1 , Stanislaw Antol 1 Yash Goyal 1 , Kevin Kochersberger 1 , Dhruv Batra 3,1 1 Virginia Tech 2 Carnegie Mellon University 3 Georgia Institute of Technology [email protected] {gordonac,aish,santol,ygoyal,kbk,dbatra}@vt.edu Abstract We present an approach to simultaneously per- form semantic segmentation and prepositional phrase attachment resolution for captioned images. Some ambiguities in language can- not be resolved without simultaneously rea- soning about an associated image. If we con- sider the sentence “I shot an elephant in my pajamas”, looking at language alone (and not using common sense), it is unclear if it is the person or the elephant wearing the pajamas or both. Our approach produces a diverse set of plausible hypotheses for both semantic segmentation and prepositional phrase attach- ment resolution that are then jointly reranked to select the most consistent pair. We show that our semantic segmentation and preposi- tional phrase attachment resolution modules have complementary strengths, and that joint reasoning produces more accurate results than any module operating in isolation. Multiple hypotheses are also shown to be crucial to im- proved multiple-module reasoning. Our vi- sion and language approach significantly out- performs the Stanford Parser (De Marneffe et al., 2006) by 17.91% (28.69% relative) and 12.83% (25.28% relative) in two different ex- periments. We also make small improvements over DeepLab-CRF (Chen et al., 2015). 1 Introduction Perception and intelligence problems are hard. Whether we are interested in understanding an im- * Denotes equal contribution PASCAL Sentence Dataset Consistent NLP: Sentence Parsing Ambiguity: (woman on couch) vs (dog on couch) Output: Parse Tree “A dog is standing next to a woman on a couch” Vision: Semantic Segmentation Labels: Chairs, desks, etc. Person Couch Couch Person Dog Solution #1 Solution #M Ambiguity: (dog next to woman) on couch vs dog next to (woman on couch) Ambiguity: (dog next to woman) on couch vs dog next to (woman on couch) Figure 1: Overview of our approach. We propose a model for simultaneous 2D semantic segmentation and preposi- tional phrase attachment resolution by reasoning about sentence parses. The language and vision modules each produce M diverse hypotheses, and the goal is to select a pair of consistent hypotheses. In this example the am- biguity to be resolved from the image caption is whether the dog is standing on or next to the couch. Both modules benefit by selecting a pair of compatible hypotheses. age or a sentence, our algorithms must operate un- der tremendous levels of ambiguity. When a hu- man reads the sentence “I eat sushi with tuna”, it is clear that the preposition phrase “with tuna” mod- ifies “sushi” and not the act of eating, but this may be ambiguous to a machine. This problem of deter- mining whether a prepositional phrase (“with tuna”) modifies a noun phrase (“sushi”) or verb phrase (“eating”) is formally known as Prepositional Phrase Attachment Resolution (PPAR) (Ratnaparkhi et al., 1994). Consider the captioned scene shown in Fig- arXiv:1604.02125v4 [cs.CV] 26 Sep 2016

arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

Resolving Language and Vision Ambiguities Together: JointSegmentation & Prepositional Attachment Resolution in Captioned Scenes

Gordon Christie1,∗, Ankit Laddha2,∗, Aishwarya Agrawal1, Stanislaw Antol1Yash Goyal1, Kevin Kochersberger1, Dhruv Batra3,1

1Virginia Tech 2Carnegie Mellon University 3Georgia Institute of [email protected]

{gordonac,aish,santol,ygoyal,kbk,dbatra}@vt.edu

Abstract

We present an approach to simultaneously per-form semantic segmentation and prepositionalphrase attachment resolution for captionedimages. Some ambiguities in language can-not be resolved without simultaneously rea-soning about an associated image. If we con-sider the sentence “I shot an elephant in mypajamas”, looking at language alone (and notusing common sense), it is unclear if it is theperson or the elephant wearing the pajamasor both. Our approach produces a diverseset of plausible hypotheses for both semanticsegmentation and prepositional phrase attach-ment resolution that are then jointly rerankedto select the most consistent pair. We showthat our semantic segmentation and preposi-tional phrase attachment resolution moduleshave complementary strengths, and that jointreasoning produces more accurate results thanany module operating in isolation. Multiplehypotheses are also shown to be crucial to im-proved multiple-module reasoning. Our vi-sion and language approach significantly out-performs the Stanford Parser (De Marneffe etal., 2006) by 17.91% (28.69% relative) and12.83% (25.28% relative) in two different ex-periments. We also make small improvementsover DeepLab-CRF (Chen et al., 2015).

1 Introduction

Perception and intelligence problems are hard.Whether we are interested in understanding an im-

* Denotes equal contribution

PASCAL Sentence Dataset

Consistent

NLP: Sentence Parsing

Ambiguity: (woman on couch) vs (dog on couch)

Output: Parse Tree

“A dog is standing next to a woman on a couch”

Vision: Semantic SegmentationLabels: Chairs, desks, etc.

Person

Couch

Couch

Person

Dog

Solution#1

Solution#M

Ambiguity:  (dog  next  to  woman)  on  couchvs  dog  next  to  (woman  on  couch)

Ambiguity:  (dog  next  to  woman)  on  couchvs  dog  next  to  (woman  on  couch)

Figure 1: Overview of our approach. We propose a modelfor simultaneous 2D semantic segmentation and preposi-tional phrase attachment resolution by reasoning aboutsentence parses. The language and vision modules eachproduce M diverse hypotheses, and the goal is to selecta pair of consistent hypotheses. In this example the am-biguity to be resolved from the image caption is whetherthe dog is standing on or next to the couch. Both modulesbenefit by selecting a pair of compatible hypotheses.

age or a sentence, our algorithms must operate un-der tremendous levels of ambiguity. When a hu-man reads the sentence “I eat sushi with tuna”, itis clear that the preposition phrase “with tuna” mod-ifies “sushi” and not the act of eating, but this maybe ambiguous to a machine. This problem of deter-mining whether a prepositional phrase (“with tuna”)modifies a noun phrase (“sushi”) or verb phrase(“eating”) is formally known as Prepositional PhraseAttachment Resolution (PPAR) (Ratnaparkhi et al.,1994). Consider the captioned scene shown in Fig-

arX

iv:1

604.

0212

5v4

[cs

.CV

] 2

6 Se

p 20

16

Page 2: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

ure 1. The caption “A dog is standing next to awoman on a couch” exhibits a PP attachment am-biguity – “(dog next to woman) on couch” vs “dognext to (woman on couch)”. It is clear that havingaccess to image segmentations can help resolve thisambiguity, and having access to the correct PP at-tachment can help image segmentation.

There are two main roadblocks that keep us fromwriting a single unified model (say a graphicalmodel) to perform both tasks: (1) Inaccurate Mod-els – empirical studies (Meltzer et al., 2005, Szeliskiet al., 2008, Kappes et al., 2013) have repeatedlyfound that models are often inaccurate and miscali-brated – their “most-likely” beliefs are placed on so-lutions far from the ground-truth. (2) Search SpaceExplosion – jointly reasoning about multiple modal-ities is difficult due to the combinatorial explosion ofsearch space ({exponentially-many segmentations}× {exponentially-many sentence-parses}).

Proposed Approach and Contributions. In thispaper, we address the problem of simultaneous ob-ject segmentation (also called semantic segmenta-tion) and PPAR in captioned scenes. To the best ofour knowledge this is the first paper to do so.

Our main thesis is that a set of diverse plausiblehypotheses can serve as a concise interpretable sum-mary of uncertainty in vision and language ‘mod-ules’ (What does the semantic segmentation mod-ule see in the world? What does the PPAR mod-ule describe?) and form the basis for tractable jointreasoning (How do we reconcile what the semanticsegmentation module sees in the world with how thePPAR module describes it?).

Given our two modules with M hypotheses each,how can we integrate beliefs across the segmenta-tion and sentence parse modules to pick the bestpair of hypotheses? Our key focus is consistency– correct hypotheses from different modules will becorrect in a consistent way, but incorrect hypotheseswill be incorrect in incompatible ways. Specifically,we develop a MEDIATOR model that scores pairs forconsistency and searches over all M2 pairs to pickthe highest scoring one. We demonstrate our ap-proach on three datasets – ABSTRACT-50S (Vedan-tam et al., 2014), PASCAL-50S, and PASCAL-Context-50S (Mottaghi et al., 2014). We show thatour vision+language approach significantly outper-forms the Stanford Parser (De Marneffe et al., 2006)

by 20.66% (36.42% relative) for ABSTRACT-50S,17.91% (28.69% relative) for PASCAL-50S, andby 12.83% (25.28% relative) for PASCAL-Context-50S. We also make small but consistent improve-ments over DeepLab-CRF (Chen et al., 2015).

2 Related Work

Most works at the intersection of vision and NLPtend to be ‘pipeline’ systems, where vision taskstake 1-best inputs from NLP (e.g., sentence pars-ings) without trying to improve NLP performanceand vice-versa. For instance, Fidler et al. (2013)use prepositions to improve object segmentation andscene classification, but only consider the most-likely parse of the sentence and do not resolve ambi-guities in text. Analogously, Yatskar et al. (2014) in-vestigate the role of object, attribute, and action clas-sification annotations for generating human-like de-scriptions. While they achieve impressive results atgenerating descriptions, they assume perfect visionmodules to generate sentences. Our work uses cur-rent (still imperfect) vision and NLP modules to rea-son about images and provided captions, and simul-taneously improve both vision and language mod-ules. Similar to our philosophy, an earlier work byBarnard and Johnson (2005) used images to helpdisambiguate word senses (e.g. piggy banks vs snowbanks). In a more recent work, Gella et al. (2016)studied the problem of reasoning about an image anda verb, where they attempt to pick the correct senseof the verb that describes the action depicted in theimage. Berzak et al. (2015) resolve linguistic ambi-guities in sentences coupled with videos that repre-sent different interpretations of the sentences. Per-haps the work closest to us is Kong et al. (2014),who leverage information from an RGBD imageand its sentential description to improve 3D seman-tic parsing and resolve ambiguities related to co-reference resolution in the sentences (e.g., what “it”refers to). We focus on a different kind of ambiguity– the Prepositional Phrase (PP) attachment resolu-tion. In the classification of parsing ambiguities, co-reference resolution is considered a discourse am-biguity (Poesio and Artstein, 2005) (arising out oftwo different words across sentences for the sameobject), while PP attachment is considered a syntac-tic ambiguity (arising out of multiple valid sentence

Page 3: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

structures) and is typically considered much moredifficult to resolve (Bach, 2016, Davis, 2016).

A number of recent works have studied problemsat the intersection of vision and language, such asVisual Question Answering (Antol et al., 2015, Ge-man et al., 2014, Malinowski et al., 2015), Vi-sual Madlibs (Yu et al., 2015), and image caption-ing (Vinyals et al., 2015, Fang et al., 2015). Ourwork falls in this domain with a key difference thatwe produce both vision and NLP outputs.

Our work also has similarities with works on‘spatial relation learning’ (Malinowski and Fritz,2014, Lan et al., 2012), i.e. learning a visual rep-resentation for noun-preposition-noun triplets (“caron road”). While our approach can certainly utilizesuch spatial relation classifiers if available, the focusof our work is different. Our goal is to improve se-mantic segmentation and PPAR by jointly rerankingsegmentation-parsing solution pairs. Our approachimplicitly learns spatial relationships for preposi-tions (“on”, “above”) but these are simply emergentlatent representations that help our reranker pick outthe most consistent pair of solutions.

Our work utilizes a line of work (Batra et al.,2012, Batra, 2012, Prasad et al., 2014) on pro-ducing diverse plausible solutions from probabilis-tic models, which has been successfully appliedto a number of problem domains (Guzman-Riveraet al., 2013, Yadollahpour et al., 2013, Gimpel etal., 2013, Premachandran et al., 2014, Sun et al.,2015, Ahmed et al., 2015).

3 Approach

In order to emphasize the generality of our approach,and to show that our approach is compatible with awide class of implementations of semantic segmen-tation and PPAR modules, we present our approachwith the modules abstracted as “black boxes” thatsatisfy a few general requirements and minimal as-sumptions. In Section 4, we describe each of themodules in detail, making concrete their respectivefeatures, and other details.

3.1 What is a Module?

The goal of a module is to take input variablesx ∈ X (images or sentences), and predict out-put variables y ∈ Y (semantic segmentation) and

z ∈ Z (prepositional attachment expressed in sen-tence parse). The two requirements on a module arethat it needs to be able to produce scores S(y|x) forpotential solutions and a list of plausible hypothesesY = {y1,y2, . . . ,yM}.

Multiple Hypotheses. In order to be useful, theset Y of hypotheses must provide an accurate sum-mary of the score landscape. Thus, the hypothesesshould be plausible (i.e., high-scoring) and mutu-ally non-redundant (i.e., diverse). Our approach (de-scribed next) is applicable to any choice of diversehypothesis generators. In our experiments, we usethe k-best algorithm of Huang and Chiang (2005)for the sentence parsing module and the DivMBestalgorithm (Batra et al., 2012) for the semantic seg-mentation module. Once we instantiate the modulesin Section 4, we describe the diverse solution gener-ation in more detail.

3.2 Joint Reasoning Across Multiple Modules

We now show how to intergrate information fromboth segmentation and PPAR modules. Recall thatour key focus is consistency – correct hypothesesfrom different modules will be correct in a consis-tent way, but incorrect hypotheses will be incorrectin incompatible ways. Thus, our goal is to searchfor a pair (semantic segmentation, sentence parsing)that is mutually consistent.

Let Y = {y1, . . . ,yM} denote the M seman-tic segmentation hypotheses and Z = {z1, . . . , zM}denote the M PPAR hypotheses.

MEDIATOR Model. We develop a “mediator”model that identifies high-scoring hypotheses acrossmodules in agreement with each other. Concretely,we can express the MEDIATOR model as a fac-tor graph where each node corresponds to a mod-ule (semantic segmentation and PPAR). Workingwith such a factor graph is typically completely in-tractable because each node y, z has exponentially-many states (image segmentations, sentence pars-ing). As illustrated in Figure 2, in this factor-graphview, the hypothesis sets Y,Z can be considered‘delta-approximations’ for reducing the size of theoutput spaces.

Unary factors S(·) capture the score/likelihoodof each hypothesis provided by the correspondingmodule for the image/sentence at hand. Pairwisefactors C(·, ·) represent consistency factors. Impor-

Page 4: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

Delta

Approximation

Semantic  Segmentation

Sentence    Parsing

Delta

Approximation Sco

re(z

)

zSco

re(y

)

y

S(yi) S(zj)

y

Sco

re(z

)

zySco

re(y

)

C(yi, zj)

z

Figure 2: Illustrative inter-module factor graph. Each node takes exponentially-many or infinitely-many states and weuse a ‘delta approximation’ to limit support.

tantly, since we have restricted each module vari-ables to just M states, we are free to capture ar-bitrary domain-specific high-order relationships forconsistency, without any optimization concerns. Infact, as we describe in our experiments, these con-sistency factors may be designed to exploit domainknowledge in fairly sophisticated ways.

Consistency Inference. We perform exhaustiveinference over all possible tuples.

argmaxi,j∈{1,...,M}

{M(yi, zj) = S(yi) + S(zj) + C(yi, zj)

}.

(1)

Notice that the search space with M hypotheseseach isM2. In our experiments, we allow each mod-ule to take a different value for M , and typicallyuse around 10 solutions for each module, leading toa mere 100 pairs, which is easily enumerable. Wefound that even with such a small set, at least one ofthe solutions in the set tends to be highly accurate,meaning that the hypothesis sets have relatively highrecall. This shows the power of using a small set ofdiverse hypotheses. For a large M , we can exploit anumber of standard ideas from the graphical modelsliterature (e.g. dual decomposition or belief propaga-tion). In fact, this is one reason we show the factorin Figure 2; there is a natural decomposition of theproblem into modules.

Training MEDIATOR. We can express the ME-DIATOR score as M(yi, zj) = wᵀφ(x,yi, zj), asa linear function of score and consistency featuresφ(x,yi, zj) = [φS(y

i);φS(zj);φC(y

i, zj)] , whereφS(·) are the single-module (semantic segmentationand PPAR module) score features, and φC(·, ·) arethe inter-module consistency features. We describethese features in detail in the experiments. We learnthese consistency weights w from a dataset anno-tated with ground-truth for the two modules y, z.Let {y∗, z∗} denote the oracle pair, composed of

the most accurate solutions in the hypothesis sets.We learn the MEDIATOR parameters in a discrimina-tive learning fashion by solving the following Struc-tured SVM problem:

minw,ξij

1

2wᵀw + C

∑ij

ξij (2a)

s.t. wᵀφ(x,y∗, z∗)︸ ︷︷ ︸Score of oracle tuple

− wᵀφ(x,yi, zj)︸ ︷︷ ︸Score of any other tuple

≥ 1︸︷︷︸Margin

− ξijL(yi, zj)︸ ︷︷ ︸

Slack scaled by loss

∀i, j ∈ {1, . . . ,M}.

(2b)

Intuitively, we can see that the constraint (2b) triesto maximize the (soft) margin between the score ofthe oracle pair and all other pairs in the hypothe-sis sets. Importantly, the slack (or violation in themargin) is scaled by the loss of the tuple. Thus,if there are other good pairs not too much worsethan the oracle, the margin for such tuples willnot be tightly enforced. On the other hand, the mar-gin between the oracle and bad tuples will be verystrictly enforced.

This learning procedure requires us to define theloss function L(yi, zj) , i.e., the cost of predictinga tuple (semantic segmentation, sentence parsing).We use a weighted average of individual losses:

L(yi, zj) = α`(ygt,yi) + (1− α)`(zgt, zj) (3)

The standard measure for evaluating semantic seg-mentation is average Jaccard Index (or Intersection-over-Union) (Everingham et al., 2010), while forevaluating sentence parses w.r.t. their prepositionalphrase attachment, we use the fraction of preposi-tions correctly attached. In our experiments, we re-port results with such a convex combination of mod-ule loss functions (for different values of α).

Page 5: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

4 Experiments

We now describe the setup of our experiments, pro-vide implementation details of the modules, and de-scribe the consistency features.

Datasets. Access to rich annotated image +caption datasets is crucial for performing quanti-tative evaluations. Since this is the first paperto study the problem of joint segmentation andPPAR, no standard datasets for this task exist sowe had to curate our own annotations for PPARon three image caption datasets – ABSTRACT-50S (Vedantam et al., 2014), PASCAL-50S (Vedan-tam et al., 2014) (expands the UIUC PASCALsentence dataset (Rashtchian et al., 2010) from 5captions per image to 50), and PASCAL-Context-50S (Mottaghi et al., 2014) (which uses the PAS-CAL Context image annotations and the same sen-tences as PASCAL-50S). Our annotations are pub-licly available on the authors’ webpages. To curatethe PASCAL-Context-50S PPAR annotations, wefirst select all sentences that have preposition phraseattachment ambiguities. We then plotted the distri-bution of prepositions in these sentences. The top 7prepositions are used, as there is a large drop in thefrequencies beyond these. The 7 prepositions are:“on”, “with”, “next to”, “in front of”, “by”, “near”,and “down”. We then further sampled sentencesto ensure uniform distribution across prepositions.We perform a similar filtering for PASCAL-50S andABSTRACT-50S (using the top-6 prepositions forABSTRACT-50S). Details are in the appendix. Weconsider a preposition ambiguous if there are at leasttwo parsings where one of the two objects in thepreposition dependency is the same across the twoparsings while the other object is different (e.g. (dogon couch) and (woman on couch)). To summarizethe statistics of all three datasets:

1. ABSTRACT-50S (Vedantam et al., 2014):25,000 sentences (50 per image) with 500images from abstract scenes made from cli-part. Filtering for captions containing the top-6prepositions resulted in 399 sentences describ-ing 201 unique images. These 6 prepositionsare: “with”, ‘next to”, “on top of”, “in frontof”, “behind”, and “under”. Overall, there are502 total prepositions, 406 ambiguous preposi-tions, 80.88% ambiguity rate and 60 sentences

with multiple ambiguous prepositions.2. PASCAL-50S (Vedantam et al., 2014): 50,000

sentences (50 per image) for the images in theUIUC PASCAL sentence dataset (Rashtchianet al., 2010). Filtering for the top-7 preposi-tions resulted in a total of 30 unique images,and 100 image-caption pairs, where ground-truth PPAR were carefully annotated by twovision + NLP graduate students. Overall,there are 213 total prepositions, 147 ambigu-ous prepositions, 69.01% ambiguity rate and35 sentences with multiple ambiguous prepo-sitions.

3. PASCAL-Context-50S (Mottaghi et al.,2014): We use images and captions fromPASCAL-50S, but with PASCAL Contextsegmentation annotations (60 categories in-stead of 21). This makes the vision task morechallenging. Filtering this dataset for the top-7prepositions resulted in a total of 966 uniqueimages and 1,822 image-caption pairs. Groundtruth annotations for the PPAR were collectedusing Amazon Mechanical Turk. Workerswere shown an image and a prepositionalattachment (extracted from the correspondingparsing of the caption) as a phrase (“womanon couch”), and asked if it was correct. Ascreenshot of our interface is available inthe appendix. Overall, there are 2,540 totalprepositions, 2,147 ambiguous prepositions,84.53% ambiguity rate and 283 sentences withmultiple ambiguous prepositions.

Setup. Single Module: We first show that visualfeatures help PPAR by using the ABSTRACT-50Sdataset, which contains clipart scenes where the ex-tent and position of all the objects in the scene isknown. This allows us to consider a scenario with aperfect vision system.

Multiple Modules: In this experiment we useimperfect language and vision modules, and showimprovements on the PASCAL-50S and PASCAL-Context-50S datasets.

Module 1: Semantic Segmentation (SS) y. Weuse DeepLab-CRF (Chen et al., 2015) and Di-vMBest (Batra et al., 2012) to produce M diversesegmentations of the images. To evaluate we useimage-level class-averaged Jaccard Index.

Module 2: PP Attachment Resolution (PPAR)

Page 6: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

z. We use a recent version (v3.3.1; released 2014)of the PCFG Stanford parser module (De Marn-effe et al., 2006, Huang and Chiang, 2005) to pro-duce M parsings of the sentence. In addition tothe parse trees, the module can also output depen-dencies, which make syntactical relationships moreexplicit. Dependencies come in the form depen-dency type(word1, word2), such as the prepositiondependency prep on(woman-8, couch-11) (the num-ber indicates the word position in sentence). To eval-uate, we count the percentage of preposition attach-ments that the parse gets correct.

Baselines:• INDEP. In our experiments, we compare our

proposed approach (MEDIATOR) to the highestscoring solution predicted independently fromeach module. For semantic segmentation this isthe output of DeepLab-CRF (Chen et al., 2015)and for the PPAR module this is the 1-best out-put of the Stanford Parser (De Marneffe et al.,2006, Huang and Chiang, 2005). Since our hy-pothesis lists are generated by greedy M-Bestalgorithms, this corresponds to predicting the(y1, z1) tuple. This comparison establishes theimportance of joint reasoning. To the best ofour knowledge, there is no existing (or evennatural) joint model to compare to.

• DOMAIN ADAPTATION. We learn a rerankeron the parses. Note that domain adaptation isonly needed for PPAR since the Stanford parseris trained on Penn Treebank (Wall Street Jour-nal text) and not on text about images (such asimage captions). Such domain adaptation is notnecessary for semantic segmentation. This isa competitive single-module baseline. Specifi-cally, we use the same parse-based features asour approach, and learn a reranker over the theMz parse trees (Mz = 10).

Our approach (MEDIATOR) significantly outper-forms both baselines. The improvements over IN-DEP show that joint reasoning produces more ac-curate results than any module (vision or language)operating in isolation. The improvements over DO-MAIN ADAPTATION establish the source of im-provements is indeed vision, and not the rerankingstep. Simply adapting the parse from its originaltraining domain (Wall Street Journal) to our domain(image captions) is not enough.

Ablative Study. Ours-CASCADE: This ablationstudies the importance of multiple hypothesis. Foreach module (say y), we feed the single-best out-put of the other module z1 as input. Each modulelearns its own weight w using exactly the same con-sistency features and learning algorithm as MEDI-ATOR and predicts one of the plausible hypothesesyCASCADE = argmaxy∈Y wᵀφ(x,y, z1). This ab-lation of our system is similar to (Heitz et al., 2008)and helps us in disentangling the benefits of multiplehypothesis and joint reasoning.

Finally, we note that Ours-CASCADE can beviewed as special cases of MEDIATOR. Let MEDI-ATOR-(My,Mz) denote our approach run with My

hypotheses for the first module and Mz for the sec-ond. Then INDEP corresponds to MEDIATOR-(1, 1)and CASCADE corresponds to predicting the y so-lution from MEDIATOR-(My, 1) and the z solutionfrom MEDIATOR-(1,Mz). To get an upper-boundon our approach, we report oracle, the accuracyof the most accurate tuple in 10× 10 tuples.

Our results are presented where MEDIATOR wastrained with equally weighted loss (α = 0.5), but weprovide additional results for varying values of α inthe appendix.

MEDIATOR and Consistency Features. Recallthat we have two types of features – (1) score fea-tures φS(yi) and φS(zj), which try to capture howlikely solutions yi and zj are respectively, and (2)consistency features φC(yi, zj), which capture howconsistent the PP attachments in zj are with thesegmentation in yi. For each (object1, preposi-tion, object2) in zj , we compute 6 features betweenobject1 and object2 segmentations in yi. Since thehumans writing the captions may use multiple syn-onymous words (e.g. dog, puppy) for the same vi-sual entity, we use word2vec (Mikolov et al., 2013)similarities to map the nouns in the sentences to thecorresponding dataset categories.

• Semantic Segmentation Score Features(φS(yi)) (2-dim): We use ranks and solutionscores from DeepLab-CRF (Chen et al., 2015).

• PPAR Score Features (φS(zi)) (9-dim): Weuse ranks and the log probability of parsesfrom (De Marneffe et al., 2006), and 7 binaryindicators for PASCAL (6 for ABSTRACT-50S) denoting which prepositions are presentin the parse.

Page 7: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

Figure 3: Example on PASCAL-50S (“A dog is stand-ing next to a woman on a couch.”). The ambiguity in thissentence “(dog next to woman) on couch” vs “dog next to(woman on couch)”. We calculate the horizontal and ver-tical distances between the segmentation centers of “per-son” and “couch” and between the segmentation centersof “dog” and “couch”. We see that the “dog” is much fur-ther below the couch (53.91) than the woman (2.65). So,if the MEDIATOR model learned that “on” means the firstobject is above the second object, we would expect it tochoose the “person on couch” preposition parsing.

• Inter-Module Consistency Features (56-dim): For each of the 7 prepositions, 8 featuresare calculated:

– One feature is the Euclidean distancebetween the center of the segmentationmasks of the two objects connected bythe preposition. These two objects in thesegmentation correspond to the categorieswith which the soft similarity of the twoobjects in the sentence is highest amongall PASCAL categories.

– Four features capture max{0, (normalized-directional-distance)}, where directional-distance measures above/below/left/rightdisplacements between the two objects inthe segmentation, and normalization in-volves dividing by height/width.

– One feature is the ratio of sizes betweenobject1 and object2 in the segmentation.

– Two features capture the word2vec sim-ilarity between the two objects in PPAR(say ‘puppy’ and ‘kitty’) with their mostsimilar PASCAL category (say ‘dog’ and‘cat’), where these features are 0 if the cat-egories are not present in segmentation.

A visual illustration for some of these featuresfor PASCAL can be seen in Figure 3. In thecase where an object parsed from zj is not

present in the segmentation yi, the distancefeatures are set to 0. The ratio of areas fea-tures (area of smaller object / area of larger ob-ject) are also set to 0 assuming that the smallerobject is missing. In the case where an ob-ject has two or more connected components inthe segmentation, the distances are computedw.r.t. the centroid of the segmentation and thearea is computed as the number of pixels inthe union of the instance segmentation masks.We also calculate 20 features for PASCAL-50Sand 59 features for PASCAL-Context-50S thatcapture that consistency between yi and zj , interms of presence/absence of PASCAL cate-gories. For each noun in PPAR we computeits word2vec similarity with all PASCAL cat-egories. For each of the PASCAL categories,the feature is the sum of similarities (with thePASCAL category) over all nouns if the cate-gory is present in segmentation, and is -1 timesthe sum of similarities over all nouns otherwise.This feature set was not used for ABSTRACT-50S, since these features were intended to helpimprove the accuracy of the semantic segmen-tation module. For ABSTRACT-50S, we onlyuse the 5 distance features, resulting in a 30-dim feature vector.

4.1 Single-Module ResultsWe performed a 10-fold cross-validation on theABSTRACT-50S dataset to pick M (=10) and theweight on the hinge-loss for MEDIATOR (C). Theresults are presented in Table 1. Our approach sig-nificantly outpeforms 1-best outputs of the Stan-ford Parser (De Marneffe et al., 2006) by 20.66%(36.42% relative). This shows a need for diverse hy-potheses and reasoning about visual features whenpicking a sentence parse. oracle denotes the bestachievable performance using these 10 hypotheses.

ModuleStanfordParser

DomainAdaptation Ours oracle

PPAR 56.73 57.23 77.39 97.53

Table 1: Results on our subset of ABSTRACT-50S.

4.2 Multiple-Module ResultsWe performed 10-fold cross-val for our results ofPASCAL-50S and PASCAL-Context-50S, with 8

Page 8: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

(a) ABSTRACT-50S (b) PASCAL-50S (c) PASCAL-Context-50SFigure 4: (a) Validation accuracies for different values of M on ABSTRACT-50S, (b) for different values of My,Mz

on PASCAL-50S, (c) for different values of My,Mz on PASCAL-Context-50S.

PASCAL-50S PASCAL-Context-50S

Instance-LevelJaccard Index PPAR Acc. Average

Instance-LevelJaccard Index PPAR Acc. Average

DeepLab-CRF 66.83 - - 43.94 - -Stanford Parser - 62.42 - - 50.75 -Average - - 64.63 - - 47.345

Domain Adaptation - 72.08 - - 58.32 -

Ours CASCADE 67.56 75.00 71.28 43.94 63.58 53.76Ours MEDIATOR 67.58 80.33 73.96 43.94 63.58 53.76oracle 69.96 96.50 83.23 49.21 75.75 62.48

Table 2: Results on our subset of the PASCAL-50S and PASCAL-Context-50S datasets. We are able to significantlyoutperform the Stanford Parser and make small improvements over DeepLab-CRF for PASCAL-50S.

train folds, 1 val fold, and 1 test fold, wherethe val fold was used to pick My, Mz, and C. Fig-ure 4 shows the average combined accuracy on val,which was found to be maximal atMy = 5,Mz = 3for PASCAL-50S, and My = 1,Mz = 10 forPASCAL-Context-50S, which are used at test time.

We present our results in Table 2. Ourapproach significantly outperforms the StanfordParser (De Marneffe et al., 2006) by 17.91%(28.69% relative) for PASCAL-50S, and 12.83%(25.28% relative) for PASCAL-Context-50S. Wealso make small improvements over DeepLab-CRF (Chen et al., 2015) in the case of PASCAL-50S.To measure statistical significance of our results, weperformed paired t-tests between MEDIATOR andINDEP. For both modules (and average), the nullhypothesis (that the accuracies of our approach andbaseline come from the same distribution) can besuccessfully rejected at p-value 0.05. For sake ofcompleteness, we also compared MEDIATOR withour ablated system (CASCADE) and found statisti-cally significant differences only in PPAR.

These results demonstrate a need for each mod-ule to produce a diverse set of plausible hypothe-ses for our MEDIATOR model to reason about. Inthe case of PASCAL-Context-50S, MEDIATOR per-forms identical to CASCADE since My is chosenas 1 (which is the CASCADE setting) in cross-validation. Recall that MEDIATOR is a larger modelclass than CASCADE (in fact, CASCADE is a specialcase of MEDIATOR with My = 1). It is interestingto see that the large model class does not hurt, andMEDIATOR gracefully reduces to a smaller capac-ity model (CASCADE) if the amount of data is notenough to warrant the extra capacity. We hypothe-size that in the presence of more training data, cross-validation may pick a different setting of My andMz, resulting in full utilization of the model capac-ity. Also note that our domain adaptation baselineachieved an accuracy higher than MAP/Stanford-Parser, but significantly lower than our approach forboth PASCAL-50S and PASCAL-Context-50S. Wealso performed this for our single-module experi-ment and picked Mz (=10) with cross-validation,

Page 9: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

on by with

Figure 5: Visualizations for3 different prepositions (red =high scores, blue = low scores).We can see that our modelhas implicitly learned spatial ar-rangements unlike other spatialrelation learning (SRL) works.

PASCAL-50S PASCAL-Context-50S

Feature setInstance-LevelJaccard Index PPAR Acc. PPAR Acc.

All features 67.58 80.33 63.58Drop all consistency 66.96 66.67 61.47Drop Euclidean distance 67.27 77.33 63.77Drop directional distance 67.12 78.67 63.63Drop word2vec 67.58 78.33 62.72Drop category presence 67.48 79.25 61.19

Table 3: Ablation study of different feature combinations. Only PPAR Acc. isshown for PASCAL-Context-50S because My = 1.

which resulted in an accuracy of 57.23%. Again,this is higher than MAP/Stanford-Parser (56.73%),but significantly lower than our approach (77.39%).Clearly, domain adaptation alone is not sufficient.We also see that oracle performance is fairly high,suggesting that when there is ambiguity and roomfor improvement, MEDIATOR is able to rerank ef-fectively.

Ablation Study for Features. Table 3 displaysresults of an ablation study on PASCAL-50S andPASCAL-Context-50S to show the importance ofthe different features. In each row, we retain themodule score features and drop a single set of con-sistency features. We can see all consistency fea-tures contribute to the performance of MEDIATOR.

Visualizing Prepositions. Figure 5 shows a vi-sualization for what our MEDIATOR model has im-plicitly learned about 3 prepositions (“on”, “by”,“with”). These visualizations show the score ob-tained by taking the dot product of distance fea-tures (Euclidean and directional) between object1and object2 connected by the preposition with thecorresponding learned weights of the model, consid-ering object2 to be at the center of the visualization.Notice that these were learned without explicit train-ing for spatial learning as in spatial relation learning(SRL) works (Malinowski and Fritz, 2014, Lan etal., 2012). These were simply recovered as an in-termediate step towards reranking SS + PPAR hy-potheses. Also note that SRL cannot handle multi-ple segmentation hypotheses, which our work showsare important (Table 2 CASCADE). In addition, ourapproach is more general.

5 Discussions and Conclusion

We presented an approach to the simultaneous rea-soning about prepositional phrase attachment res-

olution of captions and semantic segmentation inimages that integrates beliefs across the modulesto pick the best pair of a diverse set of hypothe-ses. Our full model (MEDIATOR) significantlyimproves the accuracy of PPAR over the Stan-ford Parser by 17.91% for PASCAL-50S and by12.83% for PASCAL-Context-50S, and achievesa small improvement on semantic segmentationover DeepLab-CRF for PASCAL-50S. These resultsdemonstrate a need for information exchange be-tween the modules, as well as a need for a diverse setof hypotheses to concisely capture the uncertaintiesof each module. Large gains in PPAR validate ourintuition that vision is very helpful for dealing withambiguity in language. Furthermore, we see evenlarger gains are possible from the oracle accuracies.

While we have demonstrated our approach ona task involving simultaneous reasoning about lan-guage and vision, our approach is general and canbe used for other applications. Overall, we hope ourapproach will be useful in a number of settings.

Acknowledgements. We thank Larry Zitnick,Mohit Bansal, Kevin Gimpel, and Devi Parikh forhelpful discussions, suggestions, and feedback in-cluded in this work. A majority of this work wasdone while AL was an intern at Virginia Tech. Thiswork was partially supported by a National Sci-ence Foundation CAREER award, an Army Re-search Office YIP Award, an Office of Naval Re-search grant N00014-14-1-0679, and GPU dona-tions by NVIDIA, all awarded to DB. GC was sup-ported by a DTRA Basic Research grant HDTRA1-13-1-0015 provided by KK. The views and conclu-sions contained herein are those of the authors andshould not be interpreted as necessarily representingofficial policies or endorsements, either expressed orimplied, of the U.S. Government or any sponsor.

Page 10: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

Appendix OverviewIn this appendix, we provide the following:

Appendix I: Additional motivation for our ME-DIATOR model.Appendix II: Background on ABSTRACT-50S.Appendix III: Details of the dataset curationprocess for the ABSTRACT-50S, PASCAL-50S, and PASCAL-Context-50S datasets.Appendix IV: Results where we study the ef-fect of varying the weighting of each module inour approach.Appendix V: Performances and gains overthe independent baseline on PASCAL-Context-50S for each preposition.Appendix VI: Qualitative examples from ourapproach.

Appendix I Additional Motivation forMEDIATOR

An example providing additional motivation for ourapproach is shown in Figure 6, where the ambigu-ous sentence that describes the image is “A dog isstanding next to a woman on a couch”. The ambigu-ity is “(dog next to woman) on couch” vs “dog nextto (woman on couch)”, which is reflected in parsetrees’ uncertainty. Parse tree #1 (Figure 6g) shows“standing” (the verb phrase of the noun “dog”)connected with “couch” via the “on” preposition,whereas parse trees #2 (Figure 6h) and #3 (Figure 6i)show “woman” connected with “couch” via the “on”preposition. This ambiguity can be resolved if welook at an accurate semantic segmentation such asHypothesis #3 (Figure 6f) of the associated image(Figure 6b). Likewise, we might be able to do betterat semantic segmentation if we choose a segmenta-tion that is consistent with the sentence, such as Seg-mentation Hypothesis #3 (Figure 6f), which containsa person on a couch with a dog next to them, unlikethe other two hypotheses (Figure 6d and Figure 6e).

Appendix II Background AboutABSTRACT-50S

The Abstract Scenes dataset (Zitnick and Parikh,2013) contains synthetic images generated by hu-man subjects via a drag-and-drop clipart interface.The subjects are given access to a (random) subset of56 clipart objects that can be found in park scenes,

as well as two characters, Mike and Jenny, with avariety of poses and expressions. Example scenescan be found in Figure 7. The motivation is to allowresearchers to focus on higher-level semantic under-standing without having to deal with noisy infor-mation extraction from real images, since the entirecontents of the scene are known exactly, while alsoproviding a dense semantic space to study (due tothe heavily constrained world). We used the datasetin precisely this way to first test out the PPAR mod-ule in isolation to demonstrate that this problem canbe helped by a sentence’s corresponding image.

Figure 7: We show some example scenes from (Zitnickand Parikh, 2013). Each column shows two semanticallysimilar scenes, while the different columns show the di-versity of scene types.

Appendix III Dataset Curation andAnnotation

The subsets of the PASCAL-50S and ABSTRACT-50S datasets used in our experiments were carefullycurated by two vision + NLP graduate students. Thesubset of the PASCAL-Context-50S dataset was cu-rated by Amazon Mechanical Turk (AMT) workers.The following describes the dataset curation processfor each datset.

PASCAL-50S: For PASCAL-50S we first ob-tained sentences that contain one or more of 7 prepo-sitions (i.e., “with”, “next to”, “on top of”, “infront of”, “behind”, “by”, and “on”) that intuitivelywould typically depend on the relative distance be-tween objects. Then we look for sentences that havepreposition phrase attachment ambiguities, i.e., sen-tences where the parser output has different sets ofprepositions for different parsings. Due to our fo-cus on PP attachment, we do not pay attention toother parts of the sentence parse, so the parses can

Page 11: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

(a) Caption (b) Input Image (c) Segmentation GT

(d) Segmentation Hypothesis #1 (e) Segmentation Hypothesis #2 (f) Segmentation Hypothesis #3

(g) Parse Hypothesis #1 (h) Parse Hypothesis #2 (i) Parse Hypothesis #3Figure 6: In this figure, we illustrate why the MEDIATOR model makes sense for the task of captioned scene under-standing. For the caption-image pair (Figure 6a-Figure 6b), we see that parse tree #1 (Figure 6g) shows “standing” (theverb phrase of the noun “dog”) connected with “couch” via the “on” preposition, whereas parse trees #2 (Figure 6h)and #3 (Figure 6i) show “woman” connected with “couch” via the “on” preposition. This ambiguity can be resolved ifwe look at an accurate semantic segmentation such as Hypothesis #3 (Figure 6f) of the associated image (Figure 6b).Likewise, we might be able to do better at semantic segmentation if we choose a segmentation that is consistent withthe sentence, such as Segmentation Hypothesis #3 (Figure 6f), which contains a person on a couch with a dog next tothem, unlike the other two hypotheses (Figure 6d and Figure 6e).

change while the PP attachments remain the same,as in Figure 6h and Figure 6i. The sentences thusobtained are further filtered to obtain sentences inwhich the objects that are connected by the prepo-sition belonging to one of the 20 PASCAL objectcategories. Since our vision module is semantic seg-mentation and not instance-level segmentation, werestrict the dataset to sentences involving preposi-tions connecting two different PASCAL categories.Thus, our final dataset contains 100 sentences de-

scribing 30 unique images and contains 16 of the20 PASCAL categories as described in the paper.We then manually annotated the ground truth PPattachments. Such manual labeling by student an-notators with expertise in NLP takes a lot of time,but results in annotations that are linguistically high-quality, with any inter-human disagreement resolvedby strict adherence to rules of grammar.

ABSTRACT-50S: We first obtained sentencesthat contain one or more of 6 prepositions (i.e.,

Page 12: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

“with”, “next to”, “on top of”, “in front of”, “be-hind”, “under”). Due to the semantic differences be-tween the datasets, not all prepositions found in onewere present in the other. Further filtering on sen-tences was done to ensure that the sentences containat least one preposition phrase attachment ambiguitythat is between the clipart noun categories (i.e., eachclipart piece has a name, like “snake”, that we searchthe sentence parsing for). This filtering reduced theoriginal dataset of 25,000 sentences and 500 scenesto our final experiment dataset of 399 sentences and201 scenes. We then manually annotated the groundtruth PP attachments.

PASCAL-Context-50S: For PASCAL-Context-50S, we first selected all sentences that have prepo-sition phrase attachment ambiguities. We then plot-ted the distribution of prepositions in these sentences(see Figure 8). We found that there was a drop inthe percentage of sentences for prepositions that ap-pear in the sorted list after “down”. Therefore, weonly kept sentences that have one or more 2-D visualprepositions in the list of prepositions up to “down”.Thus we ended up with the following 7 prepositions:“on”, “with”, “next to”, “in front of”, “by”, “near”,and “down”. We then further sampled sentences toensure uniform distribution across prepositions. Un-like PASCAL-50S, we did not filter sentences basedon whether the objects connected by the prepositionsbelong to one of 60 PASCAL Context categories ornot. Instead, we used the word2vec (Mikolov et al.,2013) similarity between the objects in the sentenceand the PASCAL Context categories as one of thefeatures. Thus, our final dataset contains 1,822 sen-tences describing 966 unique images.

The ground truth PP attachments for these 1,822sentences were annotated by AMT workers. Foreach unique prepositional relation in a sentence, weshowed the workers the prepositional relation of theform primary object preposition secondary objectand its associated image and sentence and askedthem to specify whether the prepositional relation iscorrect or not correct. We also asked them to choosethe third option - “Primary object/ secondary objectis not a noun in the caption” in case that happened.The user interface used to collect these annotationsis shown in Figure 9. We collected five answers foreach prepositional relation. For evaluation, we usedthe majority response. We found that 87.11% of hu-

Figure 8: We show the percentage of ambiguous sen-tences in PASCAL-Context-50S dataset before filteringfor prepositions. We found that there was a drop in thepercentage of sentences for prepositions that appear in thesorted list after “down”. So, for the PASCAL-Context-50S dataset we only keep sentences that have one ormore visual prepositions in the list of prepositions up to“down”.

man responses agree with the majority response, in-dicating that even though AMT workers were notexplicitly trained in rules of grammar by us, there isrelatively high inter-human agreement.

Appendix IV Effect of Different Weightingof Modules

So far we have used the “natural” setting of α = 0.5,which gives equal weight to both modules. Note thatα is not a parameter of our approach; it is a designchoice that the user/experiment-designer makes. Tosee the effect of weighting the modules differently,we tested our approach for various values of α.Figure 10 shows how the accuracies of each mod-ule vary depending on α for the MEDIATOR modelfor PASCAL-50S and PASCAL-Context-50S. Re-call that α is the coefficient for the semantic seg-mentation module and 1-α is the coefficient for thePPAR resolution module in the loss function. Wesee that as expected, putting no or little weight onthe PPAR module drastically hurts performance forthat module. Our approach is fairly robust to the set-ting of α, with a peak lying but any weight on it per-forms fairly similar with the peak lying somewherebetween the extremes. The segmentation modulehas similar behavior, though it is not as sensitiveto the choice of α. We believe this is because ofsmall “dynamic range” of this module – the gap be-tween the 1-best and oracle segmentation is smaller

Page 13: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

Figure 9: The AMT interface to collect ground truth annotations for prepositional relations. Five answers werecollected for each prepositional relation. The majority response is used for evaluation. The AMT workers are askedto select if the preposition is correct, not correct, or that the primary or secondary object is not a noun in the caption.Examples for all three answer choices are shown in the instructions presented to the workers.

Page 14: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

“on” “with” “next to” “in front of” “by” “near” “down”

Acc. 64.26 63.30 60.98 56.86 62.81 67.83 67.23Gain 12.47 15.41 11.18 13.27 14.13 12.82 17.16

Table 4: Performances and gains over the independent baseline on PASCAL-Context-50S for each preposition.

73.1073.2073.3073.4073.50

485154576063

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Semantic  Segmentatio

n  Accuracy

PPAR

 Accuracy

α

PPAR Semantic  Segmentation

(a) PASCAL-50S

42.0

43.0

44.0

45.0

50

55

60

65

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Semantic  Segmentatio

n  Accuracy

PPAR

 Accuracy

α

PPAR Semantic  Segmentation

(b) PASCAL-Context-50SFigure 10: Accuracies MEDIATOR (both modules) vs α,where α is the coefficient for the semantic segmentationmodule and 1-α is the coefficient for the PPAR resolu-tion module in the loss function. Our approach is fairlyrobust to the setting of α, as long as it is not set to ei-ther extremes, since that limits the synergy between themodules. (a) shows the results for PASCAL-50S, and (b)shows the results for PASCAL-Context-50S.

and thus the MEDIATOR can always default to the1-best as a safe choice.

Appendix V Performances for EachPreposition

We provide performances and gains over the inde-pendent baseline on PASCAL-Context-50S for eachpreposition in Table 4. We see that vision helps allprepositions.

Appendix VI Qualitative Examples

Figure 11 - Figure 16 show qualitative examples forour experiments. Figure 11 - Figure 13 show ex-amples for the multiple modules examples (semanticsegmentation and PPAR), and Figure 14 - Figure 16show examples for the single module experiment. Ineach figure, the top row shows the image and the as-sociated sentence. For the multiple modules figures,the second and third row show the diverse segmen-tations of the image, and the bottom two rows showdifferent parsings of the sentence (last two rows forsingle module examples, as well). In these exam-ples our approach uses 10 diverse solutions for thesemantic segmentation module and 10 different so-lutions for the PPAR module. The highlighted pairsof solutions show the solutions picked by the MEDI-ATOR model. Examining the results can give you asense of how the parsings can help the semantic seg-mentation module pick the best solution and vice-versa.

Page 15: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

A young couple sit with their puppy on the couch.

Figure 11: Example 1 – multiple modules (SS and PPAR).

A man is in a car next to a man with a bicycle.

Figure 12: Example 2 – multiple modules (SS and PPAR).

Page 16: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

Boy lying on couch with a sleeping puppy curled up next to him.

Figure 13: Example 3 – multiple modules (SS and PPAR).

Figure 14: Example 1 – single module (PPAR).

Page 17: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

Figure 15: Example 2 – single module (PPAR).

Figure 16: Example 3 – single module (PPAR).

Page 18: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

References

[Ahmed et al.2015] Faruk Ahmed, Dany Tarlow, andDhruv Batra. 2015. Optimizing ExpectedIntersection-over-Union with Candidate-ConstrainedCRFs. In ICCV.

[Antol et al.2015] Stanislaw Antol, Aishwarya Agrawal,Jiasen Lu, Margaret Mitchell, Dhruv Batra,C. Lawrence Zitnick, and Devi Parikh. 2015.VQA: Visual Question Answering. In ICCV.

[Bach2016] Kent Bach. 2016. Routledge encyclopedia ofphilosophy entry. http://online.sfsu.edu/kbach/ambguity.html.

[Barnard and Johnson2005] Kobus Barnard and MatthewJohnson. 2005. Word Sense Disambiguation with Pic-tures. Artificial Intelligence, 167.

[Batra et al.2012] Dhruv Batra, Payman Yadollahpour,Abner Guzman-Rivera, and Gregory Shakhnarovich.2012. Diverse M-Best Solutions in Markov RandomFields. In ECCV.

[Batra2012] Dhruv Batra. 2012. An Efficient Message-Passing Algorithm for the M-Best MAP Problem. InUAI.

[Berzak et al.2015] Yevgeni Berzak, Andrei Barbu,Daniel Harari, Boris Katz, and Shimon Ullman. 2015.Do You See What I Mean? Visual Resolution ofLinguistic Ambiguities. In EMNLP.

[Chen et al.2015] Liang-Chieh Chen, George Papan-dreou, Iasonas Kokkinos, Kevin Murphy, and Alan LYuille. 2015. Semantic Image Segmentation withDeep Convolutional Nets and Fully Connected CRFs.In ICLR.

[Davis2016] Ernest Davis. 2016. Notes on ambigu-ity. http://cs.nyu.edu/faculty/davise/ai/ambiguity.html.

[De Marneffe et al.2006] Marie-Catherine De Marneffe,Bill MacCartney, and Christopher D Manning. 2006.Generating Typed Dependency Parses from PhraseStructure Parses. In LREC.

[Everingham et al.2010] M. Everingham, L. Van Gool,C. K. I. Williams, J. Winn, and A. Zisserman. 2010.The Pascal Visual Object Classes (VOC) Challenge.IJCV, 88(2).

[Fang et al.2015] Hao Fang, Saurabh Gupta, Forrest N.Iandola, Rupesh Srivastava, Li Deng, Piotr Dollar,Jianfeng Gao, Xiaodong He, Margaret Mitchell,John C. Platt, C. Lawrence Zitnick, and GeoffreyZweig. 2015. From Captions to Visual Concepts andBack. In CVPR.

[Fidler et al.2013] Sanja Fidler, Abhishek Sharma, andRaquel Urtasun. 2013. A Sentence is Worth a Thou-sand Pixels. In CVPR.

[Gella et al.2016] Spandana Gella, Mirella Lapata, andFrank Keller. 2016. Unsupervised Visual Sense Dis-ambiguation for Verbs using Multimodal Embeddings.In NAACL HLT.

[Geman et al.2014] Donald Geman, Stuart Geman, NeilHallonquist, and Laurent Younes. 2014. A Visual Tur-ing Test for Computer Vision Systems. In PNAS.

[Gimpel et al.2013] K. Gimpel, D. Batra, C. Dyer, andG. Shakhnarovich. 2013. A Systematic Explorationof Diversity in Machine Translation. In EMNLP.

[Guzman-Rivera et al.2013] Abner Guzman-Rivera,Pushmeet Kohli, and Dhruv Batra. 2013. DivMCuts:Faster Training of Structural SVMs with DiverseM-Best Cutting-Planes. In AISTATS.

[Heitz et al.2008] Geremy Heitz, Stephen Gould,Ashutosh Saxena, and Daphne Koller. 2008. Cas-caded Classification Models: Combining Models forHolistic Scene Understanding. In NIPS.

[Huang and Chiang2005] Liang Huang and David Chi-ang. 2005. Better k-best Parsing. In IWPT, pages53–64.

[Kappes et al.2013] Jorg H. Kappes, Bjoern Andres,Fred A. Hamprecht, Christoph Schnorr, SebastianNowozin, Dhruv Batra, Sungwoong Kim, Bernhard X.Kausler, Jan Lellmann, Nikos Komodakis, and CarstenRother. 2013. A Comparative Study of Modern In-ference Techniques for Discrete Energy MinimizationProblems. In CVPR.

[Kong et al.2014] Chen Kong, Dahua Lin, Mohit Bansal,Raquel Urtasun, and Sanja Fidler. 2014. What are youtalking about? Text-to-Image Coreference. In CVPR.

[Lan et al.2012] Tian Lan, Weilong Yang, Yang Wang,and Greg Mori. 2012. Image Retrieval with Struc-tured Object Queries Using Latent Ranking SVM. InECCV.

[Malinowski and Fritz2014] Mateusz Malinowski andMario Fritz. 2014. A pooling approach to modellingspatial relations for image retrieval and annotation.arXiv preprint arXiv:1411.5190.

[Malinowski et al.2015] Mateusz Malinowski, MarcusRohrbach, and Mario Fritz. 2015. Ask Your Neu-rons: A Neural-based Approach to Answering Ques-tions about Images. In ICCV.

[Meltzer et al.2005] Talya Meltzer, Chen Yanover, andYair Weiss. 2005. Globally Optimal Solutions for En-ergy Minimization in Stereo Vision Using ReweightedBelief Propagation. In ICCV.

[Mikolov et al.2013] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. 2013. Efficient Estimationof Word Representations in Vector Space. In ICLR.

[Mottaghi et al.2014] Roozbeh Mottaghi, Xianjie Chen,Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, SanjaFidler, Raquel Urtasun, and Alan Yuille. 2014. The

Page 19: arXiv:1604.02125v4 [cs.CV] 26 Sep 2016 · Gordon Christie 1;, Ankit Laddha2, Aishwarya Agrawal , Stanislaw Antol Yash Goyal 1, Kevin Kochersberger , Dhruv Batra3;1 1Virginia Tech

Role of Context for Object Detection and SemanticSegmentation in the Wild. In CVPR.

[Poesio and Artstein2005] Massimo Poesio and Ron Art-stein. 2005. Annotating (Anaphoric) ambiguity. InCorpus Linguistics Conference.

[Prasad et al.2014] Adarsh Prasad, Stefanie Jegelka, andDhruv Batra. 2014. Submodular meets Structured:Finding Diverse Subsets in Exponentially-Large Struc-tured Item Sets. In NIPS.

[Premachandran et al.2014] Vittal Premachandran,Daniel Tarlow, and Dhruv Batra. 2014. EmpiricalMinimum Bayes Risk Prediction: How to extract anextra few% performance from vision models with justthree more parameters. In CVPR.

[Rashtchian et al.2010] Cyrus Rashtchian, Peter Young,Micah Hodosh, and Julia Hockenmaier. 2010. Col-lecting Image Annotations Using Amazon’s Mechan-ical Turk. In NAACL HLT Workshop on CreatingSpeech and Language Data with Amazon’s Mechan-ical Turk.

[Ratnaparkhi et al.1994] Adwait Ratnaparkhi, Jeff Rey-nar, and Salim Roukos. 1994. A Maximum EntropyModel for Prepositional Phrase Attachment. In Pro-ceedings of the workshop on Human Language Tech-nology. ACL.

[Sun et al.2015] Qing Sun, Ankit Laddha, and Dhruv Ba-tra. 2015. Active Learning for Structured Probabilis-tic Models With Histogram Approximation. In CVPR.

[Szeliski et al.2008] Richard Szeliski, Ramin Zabih,Daniel Scharstein, Olga Veksler, Vladimir Kol-mogorov, Aseem Agarwala, Marshall Tappen, andCarsten Rother. 2008. A Comparative Study ofEnergy Minimization Methods for Markov RandomFields with Smoothness-Based Priors. PAMI, 30(6).

[Vedantam et al.2014] Ramakrishna Vedantam,C. Lawrence Zitnick, and Devi Parikh. 2014. CIDEr:Consensus-based Image Description Evaluation. InCVPR.

[Vinyals et al.2015] Oriol Vinyals, Alexander Toshev,Samy Bengio, and Dumitru Erhan. 2015. Show andTell: A Neural Image Caption Generator. In CVPR.

[Yadollahpour et al.2013] Payman Yadollahpour, DhruvBatra, and Greg Shakhnarovich. 2013. DiscriminativeRe-ranking of Diverse Segmentations. In CVPR.

[Yatskar et al.2014] Mark Yatskar, Michel Galley, LucyVanderwende, and Luke Zettlemoyer. 2014. SeeNo Evil, Say No Evil: Description Generation fromDensely Labeled Images. In Lexical and Computa-tional Semantics.

[Yu et al.2015] Licheng Yu, Eunbyung Park, Alexan-der C. Berg, and Tamara L. Berg. 2015. VisualMadlibs: Fill in the Blank Description Generation andQuestion Answering. In ICCV.

[Zitnick and Parikh2013] C. Lawrence Zitnick and DeviParikh. 2013. Bringing Semantics Into Focus UsingVisual Abstraction. In CVPR.