Finding Images by Dialoguing with Imagecolalab.org › media › paper › MM-sigconf_r6oJZhl.pdf · Image Retrieval: Image retrieval task has attracted a lot of re-search attentions

Finding Images by Dialoguing with ImageLejian Ren

[email protected] of Information Engineering,

Chinese Academy of Sciences

Si Liu∗[email protected] University

Guangdong Provincial KeyLaboratory of Computer Vision andVirtual Reality Technology, SIAT, CAS

Han [email protected] University

Jizhong [email protected]

Institute of Information Engineering,Chinese Academy of Sciences

Shuicheng [email protected]

National University of Singapore

Bo [email protected] University

ABSTRACTImage retrieval in complicated scene is a challenging task that re-quires the comprehensive understanding of an image. In this paper,we propose a scene graph based image retrieval framework thatcombines the scene graph generation with image retrieval and finetuning the searching results via a dialogue mechanism. Specifically,we proposed an image retrieval oriented scene graph generationmodel that takes an image and a text describing the image as inputs.The additional text input is used to control the generated scenegraph. It provides information for a newly introduced attributeshead to better predict the attributes and helps constructing an ad-jacency matrix at the same time. Graph Convolutional Networkis further used to gather information among nodes for precise re-lation estimation. Moreover, modification on the scene graph canbe done by changing the text. Our proposed approach achievesthe state-of-the-art performances in both scene graph based imageretrieval and scene graph generation in the Visual Genome dataset.

CCS CONCEPTS• Information systems → Query representation; Personaliza-tion.

KEYWORDSImage Retrieval, Neural Networks, Scene Graph Generation, DeepLearningACM Reference Format:Lejian Ren, Si Liu, Han Huang, Jizhong Han, Shuicheng Yan, and Bo Li.2019. Finding Images by Dialoguing with Image. In Proceedings of the 27thACM International Conference on Multimedia (MM ’19), October 21–25, 2019,Nice, France. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3343031.3350907∗corresponding author.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’19, October 21–25, 2019, Nice, France© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6889-6/19/10. . . $15.00https://doi.org/10.1145/3343031.3350907

hairblond

girl

young

man

dog black

has besides

walk

hairblond

girl

young

man

dog black

has

walk

hairblond

girl

littledog black

has

walk

(a)

(b)

(c)

“blond hair young girl walks black dog”

“blond hair little girl walks black dog”

Dialog Image Scene Graph

Figure 1: Finding images by dialoguing with image. Givenan image (a) which is treated as a scene graph, we use textto change the nodes connection (remove “man”) to obtainsearching result (b) and then change the node attributes(convert “young” to “little”) to obtain searching result (c).The blue and green nodes in column 3 denote objects andattributes respectively.

1 INTRODUCTIONWith the exponential growth of multimedia data on the Internet,image retrieval becomes increasingly important. Considering a sce-nario that we have an image and are going to search similar orslightly modified ones. For example, we have a query image de-picting "Blond hair young girl with boyfriend walk black dog" asshown in Fig.1 (a). And we need to retrieve similar images. What’smore, in many cases the query image doesn’t exactly match ourrequirements. We may need to change the query like removingthe “man” (b) or replacing "young girl" with "little girl" (c). Obvi-ously, this kind of image retrieval is challenging for existing imageretrieval systems.

https://doi.org/10.1145/3343031.3350907

https://doi.org/10.1145/3343031.3350907

https://doi.org/10.1145/3343031.3350907

Approaches which are capable for such kind of image retrievalneed to: (1) obtain full information of images with complicatedstructures and (2) be able to modify the information. The conven-tional image retrieval methods can be roughly divided into twocategories, text based and content based. Text Based Image Retrieval(TBIR) approaches can handle simple query like “girl walks dog”.But when it comes to more complicated query, the searching resultsbecome scattered since TBIR is based on human generated textwhich is usually too brief to include all information. As for ContentBased Image Retrieval (CBIR) approaches, it is often used to searchidentical images which can not meet the complex requirements, letalone modification. A better way for this task is to consider imagesas scene graphs. The scene graph, composed of nodes and edgesrepresented respectively by objects and relations between them, isconsidered as a comprehensive description of an image. Objects re-lations detection is first introduced in [4, 10] which mostly focuseson spatial correlations like “A above / under B” at that time. Recentyears common relations detection has drawn increasing attentions.For a given image, the objects are first detected using popular de-tection framework [41]. Next, the relations between objects areformulated as a pair-wised classification problem. The scene graphis finally obtained by connecting the objects and relations, namelyScene Graph Generation [50] (SGG). However, existing scene graphgeneration approaches have two limitations. On one hand, theycan not do modifications on image. These SGG models intent togenerate a full graph of an image but can not change the nodes oredges. On the other hand, as these approaches are not designed forimage retrieval, object attributes are not taken into consideration.However, attribute is a crucial element in image retrieval. “Blackdog” and “white dog” can refer to visually distinct dogs.

In this paper, we propose a novel scene graph based image re-trieval framework where users can dialogue [29] with image. Ourframework consists of two stages, scene graph generation and scenegraph based image retrieval. In the first stage, a specially designedscene graph generationmodel is introduced to generate scene graphwith text association. Our scene graph generation model takes animage and a text describing this image as inputs and can be decom-posed into two components, attribute aggregated object detectionand scene graph generation. For a given image, the object detectionmodule first locates the object and classifies them. To integrate theobject attributes, we introduce another attribute head in additionto the original box head. The attribute head consists of a MLP andseveral classifiers including object color, shape, material and so on.Specially, the attribute head also takes text input into consideration.An attention model is proposed to extract adjective informationfrom the text then combined to the attribute representation so thatwe can control the attributes via different text inputs. With theattributes aggregated boxes detected by the detector, scene graphgeneration module is able to estimate relation between objects. Theobjects detected and the text input are fused in the intermediatelayer to construct an adjacency matrix which determines whetherthere is an edge between two nodes or not. The adjacency matrixis then used in the Graph Convolutional Network (GCN [21]) togather neighbor’s information for nodes. Relations between objectsare finally estimated by combining two objects features followedby a classifier and the scene graph can be obtained by connectingthese nodes and edges. In the second stage, the generated scene

graph is used to do scene graph based image retrieval. What’s more,we can refine the query graph by changing the text input to furtheradjust the searching results. By iteratively dialoguing with imagein this way, we can finally retrieve the images that exactly matchour requirements.

Our contributions can be summarized as follow. (1) We combinethe scene graph generation with image retrieval and propose acomplete framework for scene graph based image retrieval. Besides,by introducing additional text input, we can dialogue with imageand do the image retrieval more precisely. (2) We propose an imageretrieval oriented scene graph generation model that aggregates at-tributes and graph modification. To this end, we collect an attributesubset from original Visual Genome dataset and integrate attributesto the detector and use additional text input to construct adjacencymatrix for message passing. (3) Our approach outperforms otherstate-of-the-art approaches in both scene graph based image re-trieval and scene graph generation on Visual Genome dataset [22].

2 RELATEDWORKImage Retrieval: Image retrieval task has attracted a lot of re-search attentions in multimedia. Thanks to discriminative power ofconvolutional neural networks, the accuracy of image retrieval hasbeen greatly improved [11, 12, 20, 39, 40]. Kalantidis et al. createpowerful image representations via cross-dimensional weightingand aggregation of CNN layer outputs. Gordo et al. produce a globaland compact fixed-length representation for image by aggregat-ing region-wise descriptors. Radenović et al. fine-tune CNN forimage retrieval from a large collection of unordered images in afully automated manner. Such CNN-based methods focus moreon instance-level retrieval. They tend to find images with similarvisual perception but ignore the overall scene structures of images.In addition, [7, 34, 47] take natural language description of imagesas queries but still fail in complex queries. To address this concern,a new type of image representation, scene graph, which containsabundant and detailed semantic image information, is used for im-age retrieval [19]. With a well-designed conditional random fieldmodel, Johnson et al. [19] obtain the matching degrees betweengiven scene graph and images which is regarded as the rankingscores for image retrieval while entering a complicated scene graphis not user friendly. Schuster et al. [42] ease the input complexityvia a text parsing algorithm so that the graphs can be generatedusing user entered image descriptions. However, both approachescan not generate scene graph from image directly.

Another common problem in image retrieval task is the intentiongap between users and retrieval systems. The intention gap refersto the difficulty that a user suffers to precisely express the expectedvisual content as a query. Some research [13, 14, 46] has been doneto help users better express the retrieval intention and receivemore satisfactory images. In [13], apart from a query image, atext modifier is adopted to add or remove semantic concepts suchas object categories, image scenarios, and so on. Guo et al. [14]propose a dialog-based interactive image retrieval system, whichuses reinforcement learning to improve rankings from every userfeedback. Recently, Vo et al. [46] explore different mechanisms ofimage text composition to facilitate users to express real searchintention. Inspired by prior study, we allow users to add a region

or image level description in order to emphasize or modify targetconcepts in the query image.

Scene Graph Generation: Stimulated by the innovative workfrom Lu et al. [31] and the large-scale scene graph dataset VisualGenome [22], a number of approaches [2, 5, 16, 25, 26, 36, 38, 48, 50–54] have been proposed for scene graph generation and relativetasks. In the early stage, the need for independent detection processand feature design are discussed more, while neglecting to utilizevisual context for relation estimation. Researchers now mainlyadopt two methods to make use of contextual information: (1) todesign various message passing mechanisms to propagate localcontext [2, 26, 50, 51], and (2) to aggregate global context withrecurrent neural network or graph feature embedding [16, 38, 48,52]. Recently, graph-based neural networks [21, 28, 45] become a hotresearching topic and show their effectiveness in many other fields.As the scene graph is originally a special format of graphs (the nodesrepresent objects and the edges represent relationships betweenobjects respectively), we believe that using the graph convolutionnetwork to propagate information from neighboring nodes canbetter refine the feature representation of object nodes. Comparedto [51], the additional text description for the query image helpsour model to sparse the initial fully connected adjacency matrixand place attention on entities where users interested, which iscustomized for retrieval task.

In [19], the attribute of objects is considered as an importantcomponent of the scene graph, which also plays a role in retrieval.However, current scene graph generation models do not take at-tribute prediction into account. One of the major factors that hin-ders the development of attribute prediction in scene graphs isthat the Visual Genome dataset [22] has serious long tail effectsin object attributes. Referring to some works of visual attributes[8, 9, 23, 24, 37], we collect a subset of attributes from the origi-nal Visual Genome dataset to alleviate the complexity of attributeprediction task.

3 IMAGE RETRIEVAL ORIENTED SGGAn overview of our proposed approach is shown in Fig.2. Ourframework can be divided into two stages. In the first stage, aspecially designed model is used to generate scene graphs, which isbuilt upon recent success of relation detection approaches [50–52].The model takes an image and its corresponding image descriptionas inputs. The image is fed into attribute aggregated detector toobtain region proposals which are referred to as nodes of scenegraph. The information of node and image description are fusedin hidden states to construct an adjacency matrix which is used bythe following GCN to gather information from neighbors for eachnode. The relations between nodes are finally estimated to obtainthe scene graph. In the second stage, the generated scene graphcan be used to do image retrieval with image and text (optional)as query and graph as semantic representation of image. And wecan refine the searching results by modifying the image description(e.g., change the attributes, remove a specific object) which is socalled dialoguing with image.

3.1 SGG with Text AssociationIn order to constrain the graph generated by image, we introduceanother text input and design a dual inputs scene graph generationapproach with each input performs its own function. The image isused to generate the visual feature while the image description playsa guide in the graph construction. Given an image I , we first feedit to the VGG-16 based detector to obtain the Region of Interests(RoIs) and corresponding visual feature FRoI ∈ R512×7×7. FRoI isthen projected to Fnode ∈ R

512 through several fully connectedlayers. Considering an image containing objects can be regarded asa scene graph, therefore Graph Convolution Network (GCN) is usedin this work to gather information among nodes. In the originalGCN, the topological structure, or adjacency matrix A ∈ RN×N ,where N is the number of nodes, is given so that the messagepropagation can be done alone with A. However, this brings usa problem that we are going to generate a scene graph while thegraph is needed in the message propagation process. One naive wayto address this problem is to use fully connected A which meansevery element inA is set as 1. But this usually introduces redundantinformation even noise since information from some nodes aretotally useless for other nodes.

We solve this problem by utilizing the second text input men-tioned above. The texts are collected from the region caption inVisual Genome dataset. We count the frequency of all words in thedataset and only keep the top 10,000 words. The low frequencywords in text are replaced withunknown during training. The imagedescription is first converted to word embeddingsω = [ω1, . . . ,ωK ]using Word2Vec [35], where K is the number of words in imagedescription. A two layers LSTM [17] is further used to encode theword embeddings into hidden states H = [h1, . . . ,hK ]. Next, the Ais constructed via:

Ai , j = maxh=1, ...,K

{(Wαϕ(Snodei ) +Wαϕ(Snodej )) · hk

}, (1)

where Snodei ∈ RC (C denotes the number of object categories)

is the classification probability of node i , ϕ is a one layer MLPto map Snodei to the same hidden state with hk ,Wα is learnableparameters. + and · denote element-wise addition and dot productsrespectively. Each row of adjacency matrix is then normalized bySoftMax function:

Ai = So f tMax(Ai ). (2)The intuitive purpose of Eq.1 and Eq.2 is to create strong connec-

tion (near 1) between nodes which are affinitive with description,otherwise, a weak connection (near 0) is established. For example,The “person” and “bike” nodes will get higher score than “sofa” and“cat” when the given text is “a person is riding bike”.

As the graphA is established, the information aggregation amongnodes can be achieved by:

F ∗nodei= σ (Fnodei +

∑j ∈N (i)

Ai , jWβ Fnodej ), (3)

where N(·) andWβ respectively denotes the neighbor nodes of aspecific one and learnable parameters. σ is a non-linear function.The edges ri , j ∈ RR (relations) finally established by combiningthe subject and object nodes:

ri , j = δ ([F̂nodei , F̂nodej , Funion ]), (4)

AttributesAggregated

Detector

LSTM

0.0 0.2 0.8 0.0

0.6 0.0 0.2 0.2

0.5 0.1 0.0 0.4

0.2 0.7 0.1 0.0

GCN

“A man is riding a brown horse on grassland”

···

blue

plastic

···

···

Word embedding

Attention

man

horse

hatblue

plastic

brown

Scene Graph Generation stage

Image retrieval stage

Input1: image

Input2: text

Output: image retrieval results

Figure 2: Framework overview. Given an image and a text describing the whole image (or region) as inputs, the model firstdetects the objects and predicts their attributes. Next, the text input is utilized to help constructing the adjacency matrix forGCN to pass information among nodes. The generated scene graph is used to do image retrieval.

where R denotes the number of relation categories, δ denotes fullyconnected layer. The subject, object nodes together with their unionfeatures are concatenated to estimate the relation.

Specially, different from existing works that generate the wholescene graph, we only generate a sub-graph corresponding to thegiven text. This is achieved by only keeping the relations insidesthe region proposal which is related to the text during training time.By doing so, the scene graph generation is highly related to thegiven text input. During inference, an adjacency matrix full of oneis provided if no text is given. This characteristic is further utilizedto refine the query in the following section.

3.2 Attribute AggregationAttribute is an important component in image retrieval. However,This is ignored by most scene graph generation approaches. Weincorporate the attribute prediction in our framework and make ita whole for scene graph based image retrieval.

Data Collection: There are 40,513 kinds of unique attributes inthe original Visual Genome dataset. To engage the attribute classi-fication in detection, we collect a attribute subset from the originalone. The guideline is to keep the object inherent and common at-tributes. Following this guideline, the attributes with count less than50 are first removed so that the number of attributes substantiallydecreases to 1,120. Next, the remained attributes are classified into12 specific groups via popular clustering algorithm and the outlier

Figure 3: t-SNE visualization of the grouped attributes.

attributes are simply removed. The roughly clustered groups arefurther refined by manually combining the synonyms and adjustinggroup members to make it exclusive between groups. Finally, we ob-tain a attribute subset containing 11 groups with several attributesin each group and 53 kinds of attributes in total. Fig.3 visualizes the

(a) (b)

Figure 4: Attributes data distribution. (a) 11 groups of at-tributes. (b) Color group.

kept attribute groups using t-SNE [33]. We can see that attributesin each group are closer in semantic level which proves that ourgrouping is reasonable. The attributes distribution is shown in Fig.4(a). Attribute group like “age” only occurs in human so that theproportion is smaller. And the most common attribute group, color,contains 12 kinds of different colors as shown in Fig.4 (b).

Attribute Head: The attribute head has a similar architecture toconventional detection heads. It consists of 2 fully connected layersfollowed by 11 classifiers corresponding to 11 groups of attributes.For each RoI, the feature map FRoI is fed into fully connectedlayers to obtain Fvec ∈ R

1024. The text features in hidden states Hmentioned in Section.3.1 is aggregated to attributes via:

M = tanh(φ(H )) (5)α = so f tmax(WθH ) (6)

a = MαT (7)F ∗vec = Fvec + a, (8)

where φ is a fully connected layer used to convert the original H tointermediate stateM ∈ R1024,Wθ denotes the learnable parameters.In Eq.8, we introduce a text attention α . The α is multiplied withoriginal hidden states to obtain the adjective information a. We thensimply sum the Fvec with a to obtain final attributes representationF ∗vec which is used to do attributes classification. By doing so, theobject attributes and our text input are linked which can contributeto our image retrieval.

3.3 Image Retrieval and DialogueScene graph representation contains rich and structured informa-tion which is a good carrier for image retrieval. Using scene graphsto do image retrieval, the first problem we need to tackle is to definethe relevant and irrelevant searching results. On the considerationof both rationality and efficiency, we use triplet coincidence degreeto measure the graphs similarity. Two graphs with more identicaltriplets consisting of ⟨subject, relation,object⟩, the correspondingimages are more likely to be relevant. Specially, we balance theweight of different triplets in a scene graph by using the value ofadjacency matrix A so that the triplets matching text input willhave a higher weights.

Algo.1 shows the pipeline of graph based image retrieval. Thispipeline is used to both build corpus and do image retrieval. Compar-ing with TBIR and CBIR, the differences lie in feature extraction anddistance calculation. The feature extraction is replaced with graphgeneration while distance is now measured by graph similarityinstead of L2 Norm or cosine distance. With slightly modification,current image retrieval system can be converted to graph based.

Algorithm 1: Image retrieval using scene graphInput: Query Iq , Text Tq Corpus Gc=[G1,G2, . . . ,GN ]Output: Query result R[N]Graph Gq ← SGG(Iq ,Tq );for i = 1, . . . ,N do

score ← 0;for triplet in Gq do

if triplet in Gc [i] thenscore← score + 1;

endendR[i]← score

endsort the R by row in descending order;return R;

Dialoguing with Image: As the retrieval is scene graph based,modification on graph, e.g., changing the nodes (object/attributetype) or edges (relations), can directly affect the searching results.Recall that our text input corresponding to image is also fed intonetwork as a second input. This text input simultaneously providesinformation for attributes prediction and adjacency matrix con-struction. During the training time, the text input exactly matchesthe image region so that the model is trained to be highly relatedto the text input. As a result, during retrieval, we can modify theimage description to fine-tune the searching results. By iterativelydialoguing with image, we can obtain more and more accuratesearching results. Specially, if the image description is not given,we use a all one adjacency matrix.

Discussion: The goal of scene graph based image retrieval is tosearch images with similar scene structure. From this perspec-tive, it is closer to TBIR. But comparing with TBIR, it can handleimages with more complicated scene (or graph) which may take alengthy description in TBIR. However, this intuitive advantage alsobecomes a limitation when the graph is too simple even withoutedges. For example, searching for the identical things with query ora specific person/object. Hence scene graph based image retrievalis more likely to become a complementary component in currentimage retrieval systems. Besides, since the query image can not al-ways exactly match the query condition, this dialoguing process isan important way to perfect the scene graph based image retrieval.

4 EXPERIMENTSDataset andMetrics: Following existing works, we use the prunedversion [50] of VG (VG-150 for short) so that the results can becomparable. VG-150 contains 150 unique object categories and 50relation categories which is selected by frequency. We use 62,723

images for training and 26,446 for testing. The testing set is usedto evaluate the performance of both image retrieval and scenegraph generation. As for image retrieval, for each image (or query)in testing set, the graph similarities with the rest of images arecalculated so that we can get a 26, 446 × 26, 446 similarity matrix.The sorted similarity matrix (descending order by row) is regardedas the ground truth of image retrieval results and similarity withhigher score refers to higher relevance. We use mean NDCG@K[18] to measure the retrieval performance:

DCG =K∑i=1

2r eli − 1loд2(i + 1)

(9)

IDCG =

|GT |∑i=1

2r eli − 1loд2(i + 1)

(10)

nDCG =DCG

IDCG, (11)

where K is the top K retrieval results and GT denotes the idealretrieval sequence. The reli denotes the relevance with query ofresult. Here we use the graph similarity as this relevance term inNDCG.

For scene graph generation, we follow the commonly used threemetrics:• Predicate Classification: recognizing the relation betweentwo objects given the ground truth boxes and classes.• Scene Graph Classification: recognizing the relation be-tween two objects given only the ground truth boxes.• Scene Graph Detection: recognizing the relation betweentwo objects.

And accuracy is used to measure the performance of our attributehead.

Implementation details:We use Faster-RCNN [41] with VGG-16 [43] as our backbone. During training, the image batch size andproposal batch size are set to 4 and 256, respectively. The posi-tive sample ratio in RPN is set to 0.5 while NMS threshold is setto 0.7. The text input is first embedded to 100-d vector then en-coded to 512-d hidden states via 2 layers LSTM. RoIAlign [15] isused to extract RoI features which are projected to 512-d vectorto pass information in GCN. We use second-order GCN so thatevery node can obtain information from first-order neighbors andsecond-order neighbors. The attribute head only calculate losses onpositive samples and focal loss [30] is used to tackle the imbalancedata distribution problem. The whole framework is trained usinga two-stages strategy. In the first stage, we only train the detector.The whole network is trained end to end with backbone parametersfrozen in the second stage. We perform SGD [1] to optimize themodel with learning rate initialized with 1e − 3 and weight decayis set to 1e − 4. These two stages take about 18 hours and 14 hoursrespectively in single Nvidia GTX 1080Ti GPU. During image re-trieval, the object attributes are regarded as nodes with the relationset to "is".

Baselines: We compare our image retrieval performance withthree kinds of existing approaches.• TF-IDF [32, 44] is used to do text based image retrieval andthe index is build using the testing dataset.

• Both handcrafted feature (HOG [6]) and deep Multi LabelsClassifier (MLC) feature are used to do content based imageretrieval.Specially, since the images in VG dataset containmore than one object, the MLC is trained with multi labelsusing ResNeXt 152 [49].• Besides, we draw comparisons with [27, 50, 52]. IMP [50]propose a model that passes messages between nodes andedges iteratively via GRU [3]. In MSDN [27], three visiontasks (detection, scene graph detection and image caption)are solved jointly. Motifs [52] uses object and edge contextsto refine the relation representation. We also compare ourscene graph generation performances with these works.

Table 1: Image retrieval comparison using NDCG@K.

Methods NDCG@5 NDCG@10 NDCG@20TF-IDF [32, 44] 0.3707 0.3645 0.3596HOG [6] 0.2628 0.2187 0.1774MLC 0.3322 0.2732 0.2036IMP [50] 0.4325 0.3561 0.3159MSDN [27] 0.4659 0.3702 0.3168MOTIFS [52] 0.4823 0.3922 0.3345OURS 0.4977 0.4089 0.3498

0

0.1

0.2

0.3

0.4

0.5

0.6

5 15 25 35 45 55 65 75 85 95

NDCG@KOURS

MSDN

MOTIFS

HOG

MLC

TF-IDF

IMP

Figure 5: NDCG@K curve.

4.1 Results on image retrievalWe compare the performance of our approach with existing worksincluding conventional image retrieval approaches and other scenegraph generation approaches. Tab.1 shows the experimental resultson VG dataset using NDCG@K, where K is set to 5, 10 and 20respectively. Obviously, content based image retrieval approachesare incapable for this kind of retrieval task which requires the abilityof catching most semantic information of a given image. The textbased image retrieval approach has a better performance comparingwith the former. However, it can only handle some simple scenewhich can be elaborated with several words. This will be furtherdiscussed in the visualization session. As for scene graph basedimage retrieval, our approach outperforms other works. We alsoillustrate the NDCG results under continuously K and draw the

(a)

(b)

(c)

(d)

Figure 6: Top-down are TF-IDF, MLC, MOTIFS and our results (with the description “Dog on the bed”) respectively. The leftcolumn is query and the right columns are query results. Image in Green / yellow / red box denote high / medium / lowmatch.

“A man wearing hat is riding a horse near water”

“A man wearing hat is riding a horse”

“A man wearing hat is riding a white horse”

(a)

(b)

(c)

Figure 7: Searching results change with the text input. Image in red box is failure case.

NDCG@K curve which is depicted in Fig.5. From Fig.5 we can seethat the our approach has advantage with minor K . It’s caught upby TF-IDF when the K increases to more than 20. The reason is thatthere are not many highly structural similar images in the corpusbuilt from testing set. Our approach can retrieve the most similarresults so that we have better performance in minor K . However,when the K getting bigger and the highly similar images runningout, our NDCG score decreases while the TF-IDF maintains stabilitysince there are many key words in its query.

Visualization: Fig.6 visualizes the image retrieval results. Fromthe results, we can see that the text based image approach canretrieve images that matches query text segments (a). As a results,

these images are often different. The performance might get worsein practical applications since the user generated image tags aremostly not as structural and complete as the deliberately labeleddescription in VG dataset which are used for TF-IDF. The contentbased approach suffers severe “semantic” gap problem so that theresults are very strange (b). We also find that scene graph gener-ation approaches are very suitable for this task (c) since they cancatch the objects relations. And our retrieval results are more satis-fying as the comprehensive image understanding that composed of⟨doд,on,bed⟩, ⟨book,on,bed⟩ and ⟨doд,near ,book⟩ are learned byour model.

table-1

plate-1plate-3

on on

bowl-1 bowl-2

plate-2

MISSING-fork-1

on

onon

man-1

shirt-1 hair-1

phone-1

arm-1

shirt-2

head-1

person-1

mouth-1

MISSING-ear-1

MISSING-finger-1

to

of

of

of

has ofholding

has

of

wearing

wearing

behindnose-1

has

child-1

boy-2short-1

hat-1 boy-1

shirt-1

on

on

wearing

wearing

with

wearing

girl-1

short-1

wearing

jacket-1

kite-1

wearing

of of

above looking at

(a) (b)

(c) (d)

Figure 8: Scene graph generated by ourmodel. Green boxes and arrows are true positive prediction while the red ones are falsepositive. Boxes and arrows with dotted line are false negative.

Table 2: SGG results.

SGDet SGCls PredClsR@50 R@100 R@50 R@100 R@50 R@100

IMP [50] 16.3 19.0 28.3 29.3 57.2 60.1MSDN [27] 7.7 10.5 19.3 21.8 63.1 66.4MOTIFS [52] 27.0 30.3 35.8 36.5 65.2 67.1OURS 27.6 30.4 36.3 37.2 65.9 68.2

Table 3: Accuracy of attributes classification.

class Acc class Acc class AccColor 0.4746 Cloth Style 0.6286 Humidity 0.7444Material 0.5273 Age 0.7433 Capacity 0.7293Pattern 0.6205 Length 0.7153 State 0.6583Shape 0.7118 Size 0.6751 Mean 0.6571

Fig,7 shows the different searching results from the same imagewith different text inputs. From (a) we can see that the searchingresults completely matches our query. And by removing the con-nection between “water”, we obtain ⟨man, ridinд,horse⟩ images indifferent backgrounds (b). The horse become white as expectedwhen we change the horse’s color by text (c).

4.2 Components AnalysisScene Graph Generation Results: Tab.2 illustrates the perfor-mance comparison on scene graph generation. These approachesare all implemented with the VGG-16 as the backbone for fair com-parison. Our model outperforms the state-of-art methods in both

quantity and quality. Qualitative results are shown in Fig.8. Al-though there are some false negative predictions in (a), they aremostly reasonable.

Attribute Result: Even if the the attributes we used is a selectedsubset of the original one, it’s still a challenging task due to theextremely imbalance data distribution. Besides, some attributes like“silver” and “metal” are very similar in many scenario which makesthis task more complicated. The attributes performance is shownin Tab.3.

5 CONCLUSIONIn this paper we propose a scene graph based image retrieval frame-work with text association to handle complicated scene image. Theimage together with text are fed into model to generate a struc-tural and complicated scene graph which is then used to do imageretrieval. By dialoguing with image using given text, our modelcan further flexibly refine the image retrieval results. Our approachis proved to be effective in scene graph based image retrieval andscene graph generation. It also fills in the blanks of complicatedscene image retrieval. The future works can be done in the directionof scene graph hash and scene graph for detection.

ACKNOWLEDGMENTSThis work was partially supported by the National Key R&D Pro-gram of China (Grant No.2016YFC0801003, Grant U1536203, Grant61572493, Grant 61876177), Fundamental Research Funds for theCentral Universities, National Key Research and Development Pro-gram of China (Grant 2018YFC0806900 and 2017YFB1010000) andBeijingMunicipal Science and Technology Project (Grant Z191100007119008 and Z181100002718004) .

REFERENCES[1] Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent.

In Proceedings of COMPSTAT’2010. Springer, 177–186.[2] Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-

Embedded Routing Network for Scene Graph Generation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.

[3] KyunghyunCho, B vanMerrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014.On the properties of neural machine translation: Encoder-decoder approaches.In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation(SSST-8), 2014.

[4] Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese. 2013. Un-derstanding indoor scenes using 3d geometric phrases. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 33–40.

[5] Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting Visual Relationships withDeep Relational Networks. 2017 IEEE Conference on Computer Vision and PatternRecognition (CVPR) (2017), 3298–3308.

[6] Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for humandetection. In international Conference on computer vision & Pattern Recognition(CVPR’05), Vol. 1. IEEE Computer Society, 886–893.

[7] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach,Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-termrecurrent convolutional networks for visual recognition and description. InProceedings of the IEEE conference on computer vision and pattern recognition.2625–2634.

[8] Ali Farhadi, Ian Endres, Derek Hoiem, and David A. Forsyth. 2009. Describingobjects by their attributes. 2009 IEEE Conference on Computer Vision and PatternRecognition (2009), 1778–1785.

[9] Vittorio Ferrari and Andrew Zisserman. 2008. Learning visual attributes. InAdvances in neural information processing systems. 433–440.

[10] Carolina Galleguillos, Andrew Rabinovich, and Serge Belongie. 2008. Object cate-gorization using co-occurrence, location and appearance. In 2008 IEEE Conferenceon Computer Vision and Pattern Recognition. IEEE, 1–8.

[11] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deepimage retrieval: Learning global representations for image search. In Europeanconference on computer vision. Springer, 241–257.

[12] Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-endlearning of deep visual representations for image retrieval. International Journalof Computer Vision 124, 2 (2017), 237–254.

[13] Albert Gordo and Diane Larlus. 2017. Beyond Instance-Level Image Retrieval:Leveraging Captions to Learn a Global Visual Representation for Semantic Re-trieval. In Proceedings of the IEEE conference on computer vision and patternrecognition. 5272–5281.

[14] Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and RogerioFeris. 2018. Dialog-based interactive image retrieval. In Advances in NeuralInformation Processing Systems. 678–688.

[15] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn.In Proceedings of the IEEE international conference on computer vision. 2961–2969.

[16] Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, and Amir Globerson.2018. Mapping images to scene graphs with permutation-invariant structuredprediction. In Advances in Neural Information Processing Systems. 7211–7221.

[17] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-termmemory. Neuralcomputation 9, 8 (1997), 1735–1780.

[18] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluationof IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002),422–446.

[19] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, MichaelBernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedingsof the IEEE conference on computer vision and pattern recognition. 3668–3678.

[20] Yannis Kalantidis, ClaytonMellina, and Simon Osindero. 2016. Cross-dimensionalweighting for aggregated deep convolutional features. In European conference oncomputer vision. Springer, 685–701.

[21] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification withGraph Convolutional Networks. In 5th International Conference on LearningRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference TrackProceedings.

[22] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, JoshuaKravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S.Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and VisionUsing CrowdsourcedDense ImageAnnotations. International Journal of ComputerVision 123 (2016), 32–73.

[23] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learningto detect unseen object classes by between-class attribute transfer. 2009 IEEEConference on Computer Vision and Pattern Recognition (2009), 951–958.

[24] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 36 (2014), 453–465.

[25] Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and XiaogangWang. 2018. Factorizable net: an efficient subgraph-based framework for scene

graph generation. In Proceedings of the European Conference on Computer Vision(ECCV). 335–351.

[26] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017.Scene Graph Generation from Objects, Phrases and Region Captions. 2017 IEEEInternational Conference on Computer Vision (ICCV) (2017), 1270–1279.

[27] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017.Scene graph generation from objects, phrases and region captions. In Proceedingsof the IEEE International Conference on Computer Vision. 1261–1270.

[28] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. GatedGraph Sequence Neural Networks. In 4th International Conference on LearningRepresentations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference TrackProceedings.

[29] Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018.Knowledge-aware multimodal dialogue systems. In 2018 ACM Multimedia Con-ference on Multimedia Conference. ACM, 801–809.

[30] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.Focal loss for dense object detection. In Proceedings of the IEEE internationalconference on computer vision. 2980–2988.

[31] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual rela-tionship detection with language priors. In European Conference on ComputerVision. Springer, 852–869.

[32] Hans Peter Luhn. 1957. A statistical approach to mechanized encoding andsearching of literary information. IBM Journal of research and development 1, 4(1957), 309–317.

[33] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, Nov (2008), 2579–2605.

[34] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014.Deep captioning with multimodal recurrent neural networks (m-rnn). arXivpreprint arXiv:1412.6632 (2014).

[35] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems. 3111–3119.

[36] Alejandro Newell and Jia Deng. 2017. Pixels to Graphs by Associative Embedding.In NIPS.

[37] Devi Parikh and Kristen Grauman. 2011. Relative attributes. 2011 InternationalConference on Computer Vision (2011), 503–510.

[38] Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. 2018.Attentive Relational Networks for Mapping Images to Scene Graphs. arXivpreprint arXiv:1811.10696 (2018).

[39] Filip Radenović, Giorgos Tolias, and Ondřej Chum. 2016. CNN image retrievallearns from BoW: Unsupervised fine-tuning with hard examples. In Europeanconference on computer vision. Springer, 3–20.

[40] Filip Radenović, Giorgos Tolias, and Ondrej Chum. 2018. Fine-tuning CNN imageretrieval with no human annotation. IEEE transactions on pattern analysis andmachine intelligence (2018).

[41] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:Towards real-time object detection with region proposal networks. In Advancesin neural information processing systems. 91–99.

[42] Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher DManning. 2015. Generating semantically precise scene graphs from textualdescriptions for improved image retrieval. In Proceedings of the fourth workshopon vision and language. 70–80.

[43] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[44] Karen Sparck Jones. 1972. A statistical interpretation of term specificity and itsapplication in retrieval. Journal of documentation 28, 1 (1972), 11–21.

[45] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLiò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th InternationalConference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April30 - May 3, 2018, Conference Track Proceedings.

[46] Nam S. Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and JamesHays. 2018. Composing Text and Image for Image Retrieval - An EmpiricalOdyssey. arXiv preprint arXiv:1812.07119 (2018).

[47] Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on com-puter vision and pattern recognition. 5005–5013.

[48] SanghyunWoo, Dahun Kim, Donghyeon Cho, and In So Kweon. 2018. Linknet: Re-lational embedding for scene graph. In Advances in Neural Information ProcessingSystems. 560–570.

[49] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017.Aggregated residual transformations for deep neural networks. In Proceedings ofthe IEEE conference on computer vision and pattern recognition. 1492–1500.

[50] Danfei Xu, Yuke Zhu, Christopher Bongsoo Choy, and Li Fei-Fei. 2017. SceneGraph Generation by Iterative Message Passing. 2017 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) (2017), 3097–3106.

[51] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. GraphR-CNN for Scene Graph Generation. In ECCV.

[52] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs:Scene Graph Parsing with Global Context. 2018 IEEE/CVF Conference on ComputerVision and Pattern Recognition (2018), 5831–5840.

[53] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Vi-sual Translation Embedding Network for Visual Relation Detection. 2017 IEEE

Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3107–3115.[54] Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian D. Reid. 2017. Towards

Context-Aware Interaction Recognition for Visual Relationship Detection. 2017IEEE International Conference on Computer Vision (ICCV) (2017), 589–598.

Documents

Finding Images by Dialoguing with Imagecolalab.org › media › paper › MM-sigconf_r6oJZhl.pdf · Image Retrieval: Image retrieval task has attracted a lot of re-search attentions