Cross-Modality Relevance for Reasoning on Language and Vision · order, i.e. model the relevance of entity relations. Third, our framework can work with different tar-get tasks, and

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7642–7651July 5 - 10, 2020. c©2020 Association for Computational Linguistics

7642

Cross-Modality Relevance for Reasoning on Language and Vision

Chen Zheng, Quan Guo, Parisa KordjamshidiDepartment of Computer Science and Engineering, Michigan State University

East Lansing, MI 48824{zhengc12,guoquan,kordjams}@msu.edu

Abstract

This work deals with the challenge of learn-ing and reasoning over language and visiondata for the related downstream tasks suchas visual question answering (VQA) and natu-ral language for visual reasoning (NLVR). Wedesign a novel cross-modality relevance mod-ule that is used in an end-to-end frameworkto learn the relevance representation betweencomponents of various input modalities underthe supervision of a target task, which is moregeneralizable to unobserved data compared tomerely reshaping the original representationspace. In addition to modeling the relevancebetween the textual entities and visual entities,we model the higher-order relevance betweenentity relations in the text and object relationsin the image. Our proposed approach showscompetitive performance on two different lan-guage and vision tasks using public bench-marks and improves the state-of-the-art pub-lished results. The learned alignments of in-put spaces and their relevance representationsby NLVR task boost the training efficiency ofVQA task.

1 Introduction

Real-world problems often involve data from mul-tiple modalities and resources. Solving a prob-lem at hand usually requires the ability to rea-son about the components across all the involvedmodalities. Examples of such tasks are visual ques-tion answering (VQA) (Antol et al., 2015; Goyalet al., 2017) and natural language visual reason-ing (NLVR) (Suhr et al., 2017, 2018). One key tointelligence here is to identify the relations betweenthe modalities, combine and reason over them fordecision making. Deep learning is a prominenttechnique to learn representations of the data fordecision making for various target tasks. It hasachieved supreme performance based on large scalecorpora (Devlin et al., 2019). However, it is a

challenge to learn joint representations for cross-modality data because deep learning is data-hungry.There are many recent efforts to build such multi-modality datasets (Lin et al., 2014; Krishna et al.,2017; Johnson et al., 2017; Antol et al., 2015; Suhret al., 2017; Goyal et al., 2017; Suhr et al., 2018).Researchers develop models by joining features,aligning representation spaces, and using Trans-formers (Li et al., 2019b; Tan and Bansal, 2019).However, generalizability is still an issue when op-erating on unobserved data. It is hard for deeplearning models to capture high-order patterns ofreasoning, which is essential for their generalizabil-ity.

There are several challenging research directionsfor addressing learning representations for cross-modality data and enabling reasoning for targettasks. First is the alignment of the representationspaces for multiple modalities; second is designingarchitectures with the ability to capture high-orderrelations for generalizability of reasoning; third isusing pre-trained modules to make the most use ofminimal data.

An orthogonal direction to the above-mentionedaspects of learning is finding relevance between thecomponents and the structure of various modalitieswhen working with multi-modal data. Most of theprevious language and visual reasoning models tryto capture the relevance by learning representationsbased on an attention mechanism. Finding rele-vance, known as matching, is a fundamental task ininformation retrieval (IR) (Mitra et al., 2017). Ben-efiting from matching, Transformer models gainthe excellent ability to index, retrieve, and com-bine features of underlying instances by a match-ing score (Vaswani et al., 2017), which leads to thestate-of-the-art performance in various tasks (De-vlin et al., 2019). However, the matching in the at-tention mechanism is used to learn a set of weightsto highlight the importance of various components.

7643

In our proposed model, we learn representationsdirectly based on the relevance score inspired bythe ideas from IR models. In contrast to the at-tention mechanism and Transformer models, weclaim that the relevance patterns are as impor-tant. With proper alignment of the representa-tion spaces of different input modalities, matchingcan be applied to those spaces. The idea of learn-ing relevance patterns is similar to Siamese net-works (Koch et al., 2015) which learn transferablepatterns of similarity of two image representationsfor one-shot image recognition. Similarity metricbetween two modalities is shown to be helpful foraligning multiple spaces of modalities (Frome et al.,2013).

The contributions of this work are as follows:1) We propose a cross-modality relevance (CMR)framework that considers entity relevance and high-order relational relevance between the two modal-ities with an alignment of representation spaces.The model can be trained end-to-end with customiz-able target tasks. 2) We evaluate the methods andanalyze the results on both VQA and NLVR tasksusing VQA v2.0 and NLVR2 datasets respectively.We improve state-of-the-art on both tasks’ pub-lished results. Our analysis shows the significanceof the patterns of relevance for the reasoning, andthe CMR model trained on NLVR2 boosts the train-ing efficiency of the VQA task.

2 Related Work

Language and Vision Tasks. Learning and de-cision making based on natural language and vi-sual information has attracted the attention of manyresearchers due to exposing many interesting re-search challenges to the AI community. Amongmany other efforts (Lin et al., 2014; Krishna et al.,2017; Johnson et al., 2017), Antol et al. proposedthe VQA challenge that contains open-ended ques-tions about images that require an understandingof and reasoning about language and visual com-ponents. Suhr et al. proposed the NLVR task thatasks models to determine whether a sentence is truebased on the image.Attention Based Representation. Transformersare stacked self-attention models for general pur-pose sequence representation (Vaswani et al., 2017).They have been shown to achieve extraordinarysuccess in natural language processing not only forbetter results but also for efficiency due to their par-allel computations. Self-attention is a mechanism

to reshape representations of components based onrelevance scores. They have been shown to be ef-fective in generating contextualized representationsfor text entities. More importantly, there are sev-eral efforts to pre-train huge Transformers basedon large scale corpora (Devlin et al., 2019; Yanget al., 2019; Radford et al., 2019) on multiple popu-lar tasks to enable exploiting them and performingother tasks with small corpora. Researchers alsoextended Transformers with both textual and vi-sual modalities (Li et al., 2019b; Sun et al., 2019;Tan and Bansal, 2019; Su et al., 2020; Tsai et al.,2019). Sophisticated pre-training strategies were in-troduced to boost the performance (Tan and Bansal,2019). However, as mentioned above, modelingrelations between components is still a challengefor the approaches that try reshaping the entity rep-resentation space while the relevance score can bemore expressive for these relations. In our CMRframework, we model high-order relations in rel-evance representation space rather than the entityrepresentation space.Matching Models. Matching is a fundamentaltask in information retrieval (IR). There are IRmodels that focus on comparing the global rep-resentation matching (Huang et al., 2013; Shenet al., 2014), the local components (a.k.a terms)matching (Guo et al., 2016; Pang et al., 2016), andhybrid methods (Mitra et al., 2017). Our relevanceframework is partially inspired by the local com-ponents matching which we apply here to modelthe relevance of the components of the model’sinputs. However, our work differs in several signifi-cant ways. First, we work under the cross-modalitysetting. Second, we extend the relevance to a high-order, i.e. model the relevance of entity relations.Third, our framework can work with different tar-get tasks, and we show that the parameters trainedon one task can boost the training of another.

3 Cross-Modality Relevance

Cross-Modality Relevance (CMR) aims to estab-lish a framework for general purpose relevance invarious tasks. As an end-to-end model, it encodesthe relevance between the components of inputmodalities under task-specific supervision. We fur-ther add a high-order relevance between relationsthat occur in each modality.

Figure 1 shows the proposed architecture. Wefirst encode data from different modalities with sin-gle modality Transformers and align the encoding

7644

Raw Text

Faster-RCNN

Input Single-ModalityTransformer

Cross-ModalityTransformer

EntityRepresentations

Output

Entity RelevanceAffinity Matrix

Relational RelevanceAffinity Matrix

ATT

Q K V

MUL

Add & Norm

Add & Norm

Feed Forward

ATT

Q K V

MUL

Add & Norm

Add & Norm

Feed Forward

ATT

Q K V

MUL

Add & Norm

Add & Norm

Feed Forward

Visual Entityvisual entity 1visual entity 2visual entity 3

visual entity n…

Textual Entity

…

textual entity 1textual entity 2textual entity 3

textual entity n

……

……

……

BERT

CNN

CNN

max

!

Relational RelevanceRepresentation

Entity RelevanceRepresentation

Figure 1: Cross-Modality Relevance model is composed of single-modality transformer, cross-modality trans-former, entity relevance, and high-order relational relevance, followed by a task-specific classifier.

spaces by a cross-modality Transformer. We con-sistently refer to the words in text and objects inimages (i.e. bounding boxes in images) as “enti-ties” and their representations as “Entity Repre-sentations”. We use the relevance between thecomponents of the two modalities to model therelation between them. The relevance includes therelevance between their entities, as shown in the“Entity Relevance”, and high-order relevance be-tween their relations, as shown in the “RelationalRelevance”. We learn the representations of theaffinity matrix of relevance score by convolutionallayers and fully-connected layers. Finally, we pre-dict the output by a non-linear mapping based onall the relevance representations. This architec-ture can help to solve tasks that need reasoningon two modalities based on their relevance. Weargue that the parameters trained on one task canboost the training of the other tasks that deal withmulti-modality reasoning.

In this section, we first formulate the prob-lem. Then we describe our cross-modality rele-vance (CMR) model for solving the problem. Thearchitecture, loss function, and training procedureof CMR are explained in detail. We will use theVQA and NLVR tasks as showcases.

3.1 Problem Formulation

Formally, the problem is to model a mapping froma cross-modality data sample D = {Dµ} to anoutput y in a target task, where µ denotes the typeof modality. And Dµ =

{dµ1 , · · · , d

µNµ

}is a set

of entities in the modality µ. In visual questionanswering, VQA, the task is to predict an answergiven two modalities, that is a textual question (Dt)

and a visual image (Dv). In NLVR, given a textualstatement (Dt) and an image (Dv), the task is todetermine the correctness of the textual statement.

3.2 Representation Spaces Alignment

Single Modality Representations. For the textualmodality Dt, we utilize BERT (Devlin et al., 2019)as shown in the bottom-left part of Figure 1, whichis a multi-layer Transformer (Vaswani et al., 2017)with three different inputs: WordPieces embed-dings (Wu et al., 2016), segment embeddings, andposition embeddings. We refer to all the wordsas the entities in the textual modality and use theBERT representations for textual single-modalityrepresentations

{st1, · · · , stNt

}. We assume to have

N t words as textual entities.For visual modality Dv, as shown in the top-left

part of Figure 1, Faster-RCNN (Ren et al., 2015)is used to generate regions of interest (ROIs), ex-tract dense encoding representations of the ROIs,and predict the probability of each ROI. We re-fer to the ROIs on images as the visual entities{dv1, · · · , dvNv}. We consider a fixed number, Nv,of visual entities with highest probabilities pre-dicted by Faster-RCNN each time. The dense rep-resentation of each ROI is a local latent represen-tation of a 2048-dimensional vector (Ren et al.,2015). To enrich the visual entity representationwith the visual context, we further project the vec-tors with feed-forward layers and encode them bya single-modality Transformer as shown in the sec-ond column in Figure 1. The visual Transformertakes the dense representation, segment embedding,and pixel position embedding (Tan and Bansal,2019) as input and generates the single-modality

7645

representation {sv1, · · · , svNv}. In case there aremultiple images, for example, NLVR data (NLVR2)has two images in each example, each image isencoded by the same procedure and we keep Nv

visual entities per image. We refer to this as dif-ferent sources of the same modality throughoutthe paper. We restrict all the single-modality rep-resentations to be vectors of the same dimensiond. However, these original representation spacesshould be aligned.Cross-Modality Alignment. To align the single-modality representations in a uniformed repre-sentation space, we introduce a cross-modalityTransformer as shown in the third column of Fig-ure 1. All the entities are treated uniformly inthe modality Transformer. Given the set of en-tity representations from all modalities we de-fine the matrix with all the elements in the setS =

[st1, · · · , stNt , sv1, · · · , svNv

]∈ Rd×(Nt+Nv).

Each cross-modality self-attention calculation iscomputed as follows (Vaswani et al., 2017)1,

Attention (K,Q, V ) = softmax(K>Q√

d

)V,

(1)where in our case the key K, query Q, and valueV , all are the same tensor S, and softmax (·) nor-malizes along the columns. A cross-modalityTransformer layer consists of a cross-modality self-attention representation followed by residual con-nection with normalization from the input repre-sentation, a feed-forward layer, and another resid-ual connection normalization. We stack severalcross-modality Transformer layers to get a uni-form representation over all modalities. We re-fer to the resulting uniformed representations asthe entity representation and denote the set ofthe entity representations of all the entities as{s′t1 , · · · , s

′vNt , s

′v1 , · · · , s

′vNv

}. Although the rep-

resentations are still organized by their originalmodalities per entity, they carry the informationfrom the interactions with the other modality andare aligned in uniform representation space. Theentity representations, as the fourth column in Fig-ure 1, alleviate the gap between representationsfrom different modalities, as we will show in theablation studies, and allow them to be matched inthe following steps.

1Please note here we keep the usual notation of the atten-tion mechanism for this equation. The notations might havebeen overloaded in other parts of the paper.

3.3 Entity Relevance

Relevance plays a critical role in reasoning abil-ity, which is required in many tasks such as in-formation retrieval, question answering, intra- andinter-modality reasoning. Relevance patterns are in-dependent from input representation space, and canhave better generalizability to unobserved data. Toconsider the entity relevance between two modali-ties Dµ and Dν , the entity relevance representationis calculated as shown in Figure 1. Given entityrepresentation matrices S

′µ =[s′µ1 , · · · , s

′µNµ

]∈

Rd×Nµand S

′ν =[s′ν1 , · · · , s

′νNν

]∈ Rd×Nν

, therelevance representation is calculated by

Aµ,ν =(S

′µ)>

S′ν , (2a)

M (Dµ,Dν) = CNNDµ,Dν (Aµ,ν) , (2b)

where Aµ,ν is the affinity matrix of the two modal-ities as shown in the right side of Figure 1. Aµ,νijis the relevance score of ith entity in Dµ and jthentity in Dν . CNNµ,ν (·) is a CNN, correspondingto the sixth column of Figure 1, which containsseveral convolutional layers and fully connectedlayers. Each convolutional layer is followed by amax-pooling layer. Fully connected layers finallymap the flatten feature maps to d-dimensional vec-tor. We refer to ΦDµ,Dν = M (Dµ,Dν) as the en-tity relevance representation between µ and ν.

We compute the relevance between differentmodalities. For the modalities considered in thiswork, when there are multiple images in the visualmodality, we calculate the relevance representationbetween them too. In particular, for VQA dataset,the above setting results in one entity relevancerepresentation: a textual-visual entity relevanceΦDt,Dv . For NLVR2 dataset, there are three entityrelevance representations: two textual-visual entityrelevance ΦDt,Dv1 and ΦDt,Dv2 , and a visual-visualentity relevance ΦDv1 ,Dv2 between two images. En-tity relevance representations will be flattened andjoined with other features in the next layer of thenetwork.

3.4 Relational Relevance

We also consider the relevance beyond entities,that is, the relevance of the entities’ relations.This extension allows our CMR to capture higher-order relevance patterns. We consider pair-wisenon-directional relations between entities in eachmodality and calculate the relevance of the rela-

7646

Same procedure astextural relations

Entity

…

Top-K TextualRelations

Entity Relevance Affinity Matrix

Top-K

Inter-modalityImportance

Intra-modalityRelevance score

Rankingscore

max

CandidateRelation

Top-K VisualRelations

MLP$,&𝑠&

(,$

𝑠)(,$

…𝑠*+(,$

1 21 3… …1 N2 3… …N-1 N

𝑟(&,))$

𝑟(&,/)$

…𝑟(&,*+)$

𝑟(),/)$

…𝑟(*+0&,*+)$

MLP$,)

Relation

𝑟(&,/)$

𝑟(),/)$

…𝑟𝜿𝑲𝒕$

⊙

……

𝑟(&,5)6

𝑟(),7)6

…𝑟𝜿𝑲𝒗6

……

Figure 2: Relational Relevance is the relevance of top-K relations in terms of intra-modality relevance scoreand inter-modality importance.

tions across modalities. The procedure is sim-ilar to entity relevance as shown in Figure 1.We denote the relational representation as a non-linear mapping R2d → Rd modeled by fully-connected layers from the concatenation of rep-resentations of the entities in the relation rµ(i,j) =

MLPµ,1([s′µi , s

′µj

])∈ Rd. Relational relevance

affinity matrix can be calculated by matching the re-lational representation,

{rµ(i,j),∀i 6= j

}, from dif-

ferent modalities. However, there will be C2Nµ

pos-sible pairs in each modality Dµ, most of whichare irrelevant. The relational relevance representa-tions will be sparse because of the irrelevant pairson both sides. Computing the relevance score ofall possible pairs will introduce a large number ofunnecessary parameters which makes the trainingmore difficult.

We propose to rank the relation candidates (i.e.pairs) by the intra-modality relevance score andthe inter-modality importance. Then we comparethe top-K ranked relation candidates between twomodalities as shown in Figure 2. For the intra-modality relevance score, shown in the bottom leftpart of the figure, we estimate a normalized scorebased on the relational representation by a softmaxlayer.

Uµ(i,j) =exp

(MLPµ,2

(rµ(i,j)

))∑

k 6=l exp(

MLPµ,2(rµ(k,l)

)) . (3)

To evaluate the inter-modality importance of arelation candidate, which is a pair of entities in thesame modality, we first compute the relevance ofeach entity in text with respect to the visual objects.As shown in Figure 2, we project a vector that

includes the most relevant visual object for eachword, denoted this importance vector as vt. Thishelps to focus on words that are grounded in thevisual modality. We use the same procedure tocompute the most relevant words to each visualobject.

Then we calculate the relation candidates impor-tance matrix V µ by an outer product, ⊗, of theimportance vectors as follows,

vµi = maxjAµ,νij , (4a)

V µ = vµ ⊗ vµ, (4b)

where vµi is the ith scalar element in vµ that cor-responds to the ith entity, and Aµ,ν is the affinitymatrix calculated by Equation 2a.

Notice that the inter-modality importance V µ issymmetric. The upper triangular part of V µ, ex-cluding the diagonal, indicates the importance ofthe corresponding elements with the same index inintra-modality relevance scores Uµ. The rankingscore for the candidates is the combination (herethe product) of the two scoresWµ

(i,j) = Uµ(i,j)×Vµij .

We select the set of top-K ranked candidate re-lations Kµ = {κ1, κ2, · · · , κK}. We reorganizethe representation of the top-K relations as Rµ =[rµκ1 , · · · r

µκK ] ∈ Rd×K . The relational relevance

representation between Kµ and Kν can be calcu-lated similar to the entity relevance representationsas shown in Figure 1.

M (Kµ,Kν) = CNNKµ,Kν(

(Rµ)>Rν). (5)

M (Kµ,Kν) has its own parameters which resultsin a d-dimensional feature space ΦKµ,Kν .

In particular, for VQA task, the above settingresults in one relational relevance representation: atextual-visual relevance M (Kt,Kv). For NLVRtask, there are three entity relevance represen-tations: two textual-visual relational relevanceM (Kt,Kv1) and M (Kt,Kv2), and a visual-visualrelational relevance M (Kv1 ,Kv2) between two im-ages. Relational relevance representations will beflattened and joined with other features in the nextlayers of the network.

After acquiring all the entity and relational rele-vance representations, namely ΦDµ,Dν and ΦKµ,Kν ,we concatenate them and use the result as the finalfeature Φ =

[ΦDµ,Dν , · · · ,ΦKµ,Kν , · · ·

]. A task-

specific classifier MLPΦ (Φ) predicts the output ofthe target task as shown in the right-most columnin Figure 1.

7647

3.5 Training

End-to-end Training. CMR can be consideredas an end-to-end relevance representation extrac-tor. We simply predict the output y from aspecific task with the final feature Φ with adifferentiable regression or classification func-tion. The gradient of the loss function is back-propagated to all the components in CMR topenalize the prediction and adjust the parame-ters. We freeze the parameters of the basic fea-ture extractors, namely BERT for textual modalityand Faster-RCNN for visual modality. The pa-rameters of the following parts will be updatedby gradient descent: single modality Transform-ers (except BERT), the cross-modality Transform-ers, CNNDµ,Dν (·), CNNKµ,Kν (·), MLPµ,1 (·),MLPµ,2 (·) for all modalities and modality pairs,and the task-specific classifier MLPΦ (Φ).

The VQA task can be formulated as a multi-classclassification that chooses a word to answer thequestion. We apply a softmax classifier on Φ andpenalize with the cross-entropy loss. For NLVR2

dataset, the task is binary classification that deter-mines whether the statement is correct regardingthe images. We apply a logistic regression on Φand penalize with the cross-entropy loss.

Pre-training Strategy. To leverage the pre-trainedparameters of our cross-modality Transformer andrelevance representations, we use the followingtraining settings. For all tasks, we freeze the pa-rameters in BERT and faster-RCNN. We used pre-trained parameters in the (visual) single modal-ity Transformers as proposed by (Tan and Bansal,2019) and leave them being fine-tuned with thefollowing procedure. Then we randomly initial-ize and train all the parameters in the model onNLVR with NLVR2 dataset. After that, we keepand fine-tune all the parameters on the VQA taskwith the VQA v2.0 dataset. (See data descrip-tion Section 4.1.) In this way, the parameters ofthe cross-modality Transformer and relevance rep-resentations, pre-trained by NLVR2 dataset, arereused and fine-tuned on the VQA dataset. Onlythe final task-specific classifier with the input fea-tures Φ is initialized randomly. The pre-trainedcross-modality Transformer and relevance repre-sentations help the model for VQA to convergefaster and achieve a competitive performance com-pared to the state-of-the-art results.

4 Experiments and Results

4.1 Data Description

NLVR2 (Suhr et al., 2018) is a dataset that aims tojoint reasoning about natural language descriptionsand related images. Given a textual statement anda pair of images, the task is to indicate whetherthe statement correctly describes the two images.NLVR2 contains 107, 292 examples of sentencespaired with visual images and designed to empha-size semantic diversity, compositionality, and vi-sual reasoning challenges.VQA v2.0 (Goyal et al., 2017) is an extended ver-sion of the VQA dataset. It contains 204, 721 im-ages from the MS COCO (Lin et al., 2014), pairedwith 1, 105, 904 free-form, open-ended natural lan-guage questions and answers. These questions aredivided into four categories: Yes/No, Number, andOther.

4.2 Implementation Details

We implemented CMR using Pytorch2. We con-sider the 768-dimension single-modality represen-tations. For textural modality, the pre-trainedBERT “base” model (Devlin et al., 2019) is usedto generate the single-modality representation. Forvisual modality, we use Faster-RCNN pre-trainedby Anderson et al., followed by a five-layers Trans-former. Parameters in BERT and Faster-RCNNare fixed. For each example, we keep 20 wordsas textual entities and 36 ROIs per image as vi-sual entities. For the relational relevance, top-10ranked pairs are used. For each relevance CNN,CNNDµ,Dν (·) and CNNKµ,Kν (·), we use two con-volutional layers, each of which is followed by amax-pooling, and fully connected layers. For therelational representations and their intra-modalityrelevance score, MLPµ,1 (·) and MLPµ,2 (·), weuse one hidden layer for each. The task-specificclassifier MLPΦ (Φ) contains three hidden layers.The model is optimized using the Adam optimizerwith α = 10−4, β1 = 0.9, β2 = 0.999, ε = 10−6.The model is trained with a weight decay 0.01, amax gradient normalization 1.0, and a batch sizeof 32.

4.3 Baseline Description

VisualBERT (Li et al., 2019b) is an End-to-Endmodel for language and vision tasks, consists of

2Our code and data is available at https://github.com/HLR/Cross_Modality_Relevance.

https://github.com/HLR/Cross_Modality_Relevance

https://github.com/HLR/Cross_Modality_Relevance

7648

Models Dev% Test%N2NMN 51.0 51.1

MAC-Network 50.8 51.4FiLM 51.0 52.1

CNN+RNN 53.4 52.4VisualBERT 67.4 67.0LXMERT 74.9 74.5

CMR 75.4 75.3

Table 1: Accuracy on NLVR2.

Transformer layers that align textual and visual rep-resentation spaces with self-attention. VisualBERTand CMR have a similar cross-modality alignmentapproach. However, VisualBERT only uses theTransformer representations while CMR uses therelevance representations.LXMERT (Tan and Bansal, 2019) aims to learncross-modality encoder representations from Trans-formers. It pre-trains the model with a set oftasks and fine-tunes on another set of specific tasks.LXMERT is the currently published state-of-the-arton both NLVR2 and VQA v2.0.

4.4 Results

NLVR2: The results of NLVR task are listedin Table 1. Transformer based models (Visual-BERT, LXMERT, and CMR) outperform othermodels (N2NMN (Hu et al., 2017), MAC (Hud-son and Manning, 2018), and FiLM (Perez et al.,2018)) by a large margin. This is due to the strongpre-trained single-modality representations and theTransformers’ ability to reshape the representationsthat align the spaces. Furthermore, CMR showsthe best performance compared to all Transformer-based baseline methods and achieves state-of-the-art. VisualBERT and CMR have similar cross-modality alignment approach. CMR outperformsVisualBERT by 12.4%. The gain mainly comesfrom entity relevance and relational relevance thatmodel the relations.VQA v2.0: In Table 2, we show the compari-son with published models excluding the ensem-ble ones. Most competitive models are based onTransformers (ViLBERT (Lu et al., 2019), Visu-alBERT (Li et al., 2019b), VL-BERT (Su et al.,2020), LXMERT (Tan and Bansal, 2019), andCMR). BUTD (Anderson et al., 2018; Teney et al.,2018), ReGAT (Li et al., 2019a), and BAN (Kimet al., 2018) also employ attention mechanismfor a relation-aware model. The proposed CMRachieves the best test accuracy on Y/N questionsand Other questions. However, CMR does not

Model Dev% Test Standard%Overall Y/N Num Other Overall

BUTD 65.32 81.82 44.21 56.05 65.67ReGAT 70.27 86.08 54.42 60.33 70.58

ViLBERT 70.55 - - - 70.92VisualBERT 70.80 - - - 71.00

BAN 71.4 87.22 54.37 62.45 71.84VL-BERT 71.79 87.94 54.75 62.54 72.22LXMERT 72.5 87.97 54.94 63.13 72.54

CMR 72.58 88.14 54.71 63.16 72.60

Table 2: Accuracy on VQA v2.0.

achieve the best performance on Number questions.This is because Number questions require the abil-ity to count numbers in one modality while CMRfocuses on modeling relations between modalities.Performance on counting might be improved by ex-plicit modeling of quantity representations. CMRalso achieves the best overall accuracy. In particu-lar, we can see a 2.3% improvement over Visual-BERT (Li et al., 2019b), as in the above mentionedNLVR2 results. This shows the significance of theentity and relational relevance.

Another observation is that, if we train CMR forVQA task from scratch with random initializationwhile still use the fixed BERT and Faster-RCNN,the model converges after 20 epochs. As we ini-tialize the parameters with the model trained onNLVR2, it takes 6 epochs to converge. The signifi-cant improvement of convergence speed indicatesthat the optimal model for VQA is close to that ofNLVR.

5 Analysis

5.1 Model Size

To investigate the influence of model sizes, we em-pirically evaluated CMR on NLVR2 with varioussets of Transformers sizes which contain the mostparameters of the model. All other details are keptthe same as descriptions in Section 4.2. TextualTransformer remains 12 layers because it is thepre-trained BERT. Our model contains 285M pa-rameters. Among these parameters, around 230Mparameters belong to pre-trained BERT and Trans-former. Table 3 shows the results. As we increasethe number of layers in the visual Transformerand the cross-modality Transformer, it tends toimprove accuracy. However, the performance be-comes stable when there are more than five layers.We choose five layers of visual Transformer andcross-modality Transformer in other experiments.

7649

Textural Visual Cross Dev% Test%12 3 3 74.1 74.412 4 4 74.9 74.712 5 5 75.4 75.312 6 6 75.5 75.1

Table 3: Accuracy on NLVR2 of CMR with variousTransformer sizes. The numbers in the left part of thetable indicate the number of self-attention layers.

Models Dev% Test%CMR 75.4 75.3

without Single-Modality Transformer 68.2 68.5without Cross-Modality Transformer 59.7 59.1

without Entity Relevance 70.6 71.2without Relational Relevance 73.0 73.4

Table 4: Test accuracy of different variations of CMRon NLVR2.

5.2 Ablation Studies

To better understand the influence of each partin CMR, we perform the ablation study. Ta-ble 4 shows the performances of four variationson NLVR2.Effect of Single Modality Transformer. We re-move both textual and visual single-modality Trans-formers and map the raw input with a linear trans-formation to d-dimensional space instead. Noticethat the raw input of textual modality is the Word-Pieces (Wu et al., 2016) embeddings, segment em-beddings, and the position embeddings of eachword, while that of visual modality is the 2048-dimension dense representation of each ROI ex-tracted by Faster-RCNN. It turns out that removingsingle-modality Transformers decreases the accu-racy by 9.0%. Single modality Transformers playa critical role in producing a strong contextualizedrepresentation for each modality.Effect of Cross-Modality Transformer. We re-move the cross-modality Transformer and usesingle-modality representations as entity represen-tations. As shown in Table 4, the model degen-erates dramatically, and the accuracy decreasesby 16.2%. The huge accuracy gap demonstratesthe unparalleled contribution of the cross-modalityTransformer to aligning representation spaces frominput modalities.Effect of Entity Relevance. We remove the entityrelevance representation ΦDµ,Dν from the final fea-ture Φ. As shown in Table 4, the test accuracy isreduced by 5.4%. This is a significant difference ofperformance among Transformer based models (Liet al., 2019b; Lu et al., 2019; Tan and Bansal, 2019).

The bird on the branch is looking to left

Figure 3: The entity affinity matrix between textual(rows) and visual (columns) modalities. The darkercolor indicates the higher relevance score. The ROIswith maximum relevance score for each word areshown paired with the words.

Figure 4: The relation ranking score of two examplesentence. The darker color indicates the higher rankingscore.

To highlight the significance of entity relevance,we visualize an example affinity matrix in Figure 3.The two major entities, “bird” and “branch”, arematched perfectly. More interestingly, the threeROIs which are matching the phrase “looking toleft” capture an indicator (the beak), a direction(left), and the semantic of the whole phrase.Effect of Relational Relevance. We remove theentity relevance representation ΦKµ,Kν from thefinal feature Φ. A 2.5% decrease in test accuracyis observed in Table 4. We argue that CMR mod-els high-order relations, which are not captured inentity relevance, by modeling relational relevance.We present two examples of textual relation rank-ing scores in Figure 4. The learned ranking scorehighlights the important pairs, for example “gold -top”, “looking - left”, which describe the importantrelations in textual modality.

6 Conclusion

In this paper, we propose a novel cross-modality rel-evance (CMR) for language and vision reasoning.Particularly, we argue for the significance of rele-vance between the components of the two modali-ties for reasoning, which includes entity relevance

7650

and relational relevance. We propose an end-to-endcross-modality relevance framework that is tailoredfor language and vision reasoning. We evaluate theproposed CMR on NLVR and VQA tasks. Our ap-proach exceeds the state-of-the-art on NLVR2 andVQA v2.0 datasets. Moreover, the model trainedon NLVR2 boosts the training of VQA v2.0 dataset.The experiments and the empirical analysis demon-strate CMR’s capability of modeling relational rel-evance for reasoning and consequently its bettergeneralizability to unobserved data. This result in-dicates the significance of relevance patterns. Ourproposed architectural component for capturing rel-evance patterns can be used independently from thefull CMR architecture and is potentially applicablefor other multi-modal tasks.

Acknowledgments

We thank the anonymous reviewers for their help-ful comments. This project is supported by Na-tional Science Foundation (NSF) CAREER award#1845771.

ReferencesPeter Anderson, Xiaodong He, Chris Buehler, Damien

Teney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6077–6086.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. Vqa: Visual question an-swering. In Proceedings of the IEEE internationalconference on computer vision, pages 2425–2433.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Andrea Frome, Gregory S. Corrado, Jonathon Shlens,Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato,and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NIPS.

Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, andD. Parikh. 2017. Making the v in vqa matter: El-evating the role of image understanding in visualquestion answering. In 2017 IEEE Conference onComputer Vision and Pattern Recognition (CVPR),pages 6325–6334.

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W BruceCroft. 2016. A deep relevance matching modelfor ad-hoc retrieval. In Proceedings of the 25thACM International on Conference on Informationand Knowledge Management, pages 55–64. ACM.

Ronghang Hu, Jacob Andreas, Marcus Rohrbach,Trevor Darrell, and Kate Saenko. 2017. Learningto reason: End-to-end module networks for visualquestion answering. In Proceedings of the IEEE In-ternational Conference on Computer Vision (ICCV).

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,Alex Acero, and Larry Heck. 2013. Learning deepstructured semantic models for web search usingclickthrough data. In Proceedings of the 22nd ACMinternational conference on Conference on informa-tion & knowledge management, pages 2333–2338.ACM.

Drew A Hudson and Christopher D Manning. 2018.Compositional attention networks for machine rea-soning. In International Conference on LearningRepresentations (ICLR).

Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Li Fei-Fei, C. Lawrence Zitnick, and RossGirshick. 2017. Clevr: A diagnostic dataset for com-positional language and elementary visual reasoning.In The IEEE Conference on Computer Vision andPattern Recognition (CVPR).

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang.2018. Bilinear attention networks. In Advancesin Neural Information Processing Systems, pages1564–1574.

Gregory Koch, Richard Zemel, and Ruslan Salakhut-dinov. 2015. Siamese neural networks for one-shotimage recognition. In Proceedings of the 32 nd In-ternational Conference on Machine Learning, Lille,France, 2015. JMLR: W&CP volume 37.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. 2017. Visual genome: Connecting languageand vision using crowdsourced dense image anno-tations. International Journal of Computer Vision,123(1):32–73.

Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu.2019a. Relation-aware graph attention network forvisual question answering. 2019 IEEE/CVF Inter-national Conference on Computer Vision (ICCV),pages 10312–10321.

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang. 2019b. Visualbert: Asimple and performant baseline for vision and lan-guage. arXiv preprint arXiv:1908.03557.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.1109/CVPR.2017.670



7651

Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. In Advances in Neural Information Process-ing Systems, pages 13–23.

Bhaskar Mitra, Fernando Diaz, and Nick Craswell.2017. Learning to match using local and distributedrepresentations of text for web search. In Proceed-ings of the 26th International Conference on WorldWide Web, pages 1291–1299. International WorldWide Web Conferences Steering Committee.

Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengx-ian Wan, and Xueqi Cheng. 2016. Text matching asimage recognition. In Thirtieth AAAI Conference onArtificial Intelligence.

Ethan Perez, Florian Strub, Harm De Vries, VincentDumoulin, and Aaron Courville. 2018. Film: Vi-sual reasoning with a general conditioning layer. InThirty-Second AAAI Conference on Artificial Intelli-gence.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog, 1(8).

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. InAdvances in neural information processing systems,pages 91–99.

Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng,and Gregoire Mesnil. 2014. Learning semantic rep-resentations using convolutional neural networks forweb search. In Proceedings of the 23rd Interna-tional Conference on World Wide Web, pages 373–374. ACM.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2020. Vl-bert: Pre-training of generic visual-linguistic representations.In International Conference on Learning Represen-tations.

Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi.2017. A corpus of natural language for visual rea-soning. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), pages 217–223, Vancou-ver, Canada. Association for Computational Linguis-tics.

Alane Suhr, Stephanie Zhou, Iris D. Zhang, HuajunBai, and Yoav Artzi. 2018. A corpus for reasoningabout natural language grounded in photographs. InACL.

Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-phy, and Cordelia Schmid. 2019. Videobert: A jointmodel for video and language representation learn-ing. 2019 IEEE/CVF International Conference onComputer Vision (ICCV), pages 7463–7472.

Hao Tan and Mohit Bansal. 2019. Lxmert: Learningcross-modality encoder representations from trans-formers. In Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Process-ing.

Damien Teney, Peter Anderson, Xiaodong He, and An-ton van den Hengel. 2018. Tips and tricks for vi-sual question answering: Learnings from the 2017challenge. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages4223–4232.

Yao-Hung Tsai, Shaojie Bai, Paul Pu Liang, J. ZicoKolter, Louis-Philippe Morency, and RuslanSalakhutdinov. 2019. Multimodal transformer forunaligned multimodal language sequences. In ACL.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, Melvin John-son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws,Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil, Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rud-nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,and Jeffrey Dean. 2016. Google’s neural machinetranslation system: Bridging the gap between humanand machine translation. CoRR, abs/1609.08144.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In NeurIPS.

https://openreview.net/forum?id=SygXPaEYvH

https://openreview.net/forum?id=SygXPaEYvH

https://doi.org/10.18653/v1/P17-2034

https://doi.org/10.18653/v1/P17-2034

http://arxiv.org/abs/1609.08144



Documents

Cross-Modality Relevance for Reasoning on Language and Vision · order, i.e. model the relevance of entity relations. Third, our framework can work with different tar-get tasks, and