Multiple Visual-Semantic Embedding for Video Retrieval ... · guitar Similarity Aggregation s N Final Similarity Global Visual Network Textual Embedding Network Sequential Visual

1

Multiple Visual-Semantic Embedding for VideoRetrieval from Query Sentence

Huy Manh Nguyen, Tomo Miyazaki,Yoshihiro Sugaya,and Shinichiro Omachi,

Abstract—Visual-semantic embedding aims to learn a jointembedding space where related video and sentence instances arelocated close to each other. Most existing methods put instancesin a single embedding space. However, they struggle to embedinstances due to the difficulty of matching visual dynamics invideos to textual features in sentences. A single space is notenough to accommodate various videos and sentences. In thispaper, we propose a novel framework that maps instances intomultiple individual embedding spaces so that we can capturemultiple relationships between instances, leading to compellingvideo retrieval. We propose to produce a final similarity betweeninstances by fusing similarities measured in each embeddingspace using a weighted sum strategy. We determine the weightsaccording to a sentence. Therefore, we can flexibly emphasizean embedding space. We conducted sentence-to-video retrievalexperiments on a benchmark dataset. The proposed methodachieved superior performance, and the results are competitiveto state-of-the-art methods. These experimental results demon-strated the effectiveness of the proposed multiple embeddingapproach compared to existing methods.

I. INTRODUCTION

V IDEO has become an essential source for humans tolearn and acquire knowledge. Due to the increased

demand for sharing and accumulating information, there is amassive amount of video being produced out there in the worldevery day. However, compared to images, videos usually con-tain much semantic information, and thus it is hard for humansto organize videos. Therefore, it is critical to developing analgorithm that can efficiently perform multimedia analysis andautomatically understand the semantic information of videos.

A common approach for video analysis and understandingis to form a joint embedding space of video and sentenceusing multimodal learning. Similar to the fact that humansexperience the world with multiple senses, the goal of multi-modal learning is to develop a model that can simultaneouslyprocess multiple modalities, such as visual, text, and audio,in an integrated manner by constructing a joint embeddingspace. Such models can map various modalities into a sharedEuclidean space where distances and directions capture usefulsemantic relationships. It enables us to not only learn the se-mantic presentations of videos by leveraging other modalitiesinformation but also show great potential in tasks such ascross-modal retrieval and visual recognition.

Recent works in learning an embedding space bridge thegap between sentence and visual information by utilizing

H. M. Nguyen was with the Department of Communication Engineering atTohoku University, Sendai, Miyagi JAPAN 9808579

T. Miyazaki, Y. Sugaya, and S. Omachi are with Tohoku University. e-mail:(see https://tomomiyazaki.github.io).

advancements in image and language understanding [1], [2].Most approaches build the embedding space by connectingvisual and textual embedding paths. Generally, a visual pathuses a Convolutional Neural Network (CNN) to transformvisual appearances into a vector. Likewise, Recurrent NeuralNetwork (RNN) embeds sentences in a textual path. However,capturing the relationship between video and sentence remainschallenging. Recent works suffer from extracting visual dy-namics in a video [3]–[6].

In this paper, we propose a novel framework equipped withmultiple embedding networks so that we can capture variousrelationships between video and sentence, leading to morecompelling video retrieval. Precisely, one network captures therelationship between an overall appearance in the video anda textual feature. Others consider consecutive appearances oraction features. Thus, the networks learn their own embeddingspaces. We fuse the similarities measured in the multiplespaces using the weighted summing strategy to produce thefinal similarity between video and sentence.

The main contribution of this paper is a novel approach tomeasure the similarity between video and sentence by fusingsimilarities in multiple embedding spaces. Consequently, wecan measure the similarity with multiple understandings andrelationships of video and sentence. Besides, we emphasizethat the proposed method can quickly expand the number ofembedding spaces. We demonstrated the effectiveness of theproposed method by the experimental results. We conductedvideo retrieval experiments using query sentences on thestandard benchmark dataset and demonstrated an improvementof our approach compared to existing methods.

II. RELATED WORK

Vision and Language Understanding

There have been many efforts in connecting vision andlanguage, which focus on building a joint visual-semanticspace [7]. Various applications in the area of computer visionneeds such a joint space to realize tagging [1], retrieval [8],captioning [8], and visual question answering [9].

Recent works in image-to-text retrieval embed image andsentence into a joint embedding trained with ranking loss.A penalty is applied when an incorrect sentence is rankedhigher than the correct sentence [10]–[14]. Another popularloss function is triplet ranking [1], [15]–[17]. VSE++ modelimproves the triplet ranking loss by focusing on the hardestnegative samples (the most similar yet incorrect sample in amini-batch) [18].

arX

iv:2

004.

0796

7v1

[cs

.CV

] 1

6 A

pr 2

020

2

Video and Sentence Embedding

As same as image-text retrieval approaches, most video-to-text retrieval methods learn a joint embedding space [19]–[21]. The method [3] incorporates web images searched witha sentence into an embedding process to take into accountfine-grained visual concepts. However, the method treats videoframes as a set of unrelated images and average them outto get a final video embedding vector. Thus, it may lead toinefficiency in learning an embedding space since temporalinformation of the video is lost.

Mithun et al. tackle this issue [22] by learning two differ-ent embedding spaces to consider temporal and appearance.Also, they extract audio features from the video for learningspace. This approach achieved accurate performance for thesentence-to-video retrieval task. However, this approach putsequal importance on both embedding spaces. Practically, equalimportance does not work well. There are cases that one em-bedding space is more important than the others in capturingsemantic similarity between video and sentence. Therefore, wepropose a novel mechanism emphasizing a space so that wecan know the critical visual cues.

III. THE PROPOSED METHOD

A. Overview

We describe the problem of learning a joint embeddingspace for video and sentence embedding. Given video andsentence instances which are sequences of frames and words,the aim is to train embedding functions that map them into ajoint embedding space. Formally, we use embedding functionsf : X → Z and g : Y → Z , where X and Y are videoand sentence domains, respectively. Z is a joint embeddingspace. The similarity s(f(x), g(y)) between X and Y arecalculated in Z by a certain measurement. Therefore, theultimate goal is learning f and g satisfying the followingequation: s(f(xi), g(yi)) > s(f(xi), g(yj)),∀ i 6= j. Thisencourages similarity will increase in a same pair xi and yi,whereas it decreases in a different pair xi and yj .

As we illustrated the overview of the proposed frameworkin Fig. 1, there are two networks for embedding videos: theglobal and the sequential visual networks. These networkshave their counterparts that are embedding the sentence. Thus,we develop multiple visual and textual networks that are re-sponsible for embedding a video and a sentence, respectively.We form one embedding space by merging two networks (onevisual and one textual) so that we can align a visual featureto textual information. Specifically, the global visual networkaligns average visual information of the video to the sentence.Likewise, the sequential visual network aligns a temporalcontext to the sentence. Consequently, the networks receivea video and sentence as inputs and map them into the jointembedding spaces.

We propose to fuse similarities measured in the embeddingspaces to produce the final similarity. One embedding spacehas the global visual network extracting a visual object featurefrom the video. Thus, this embedding space describes arelationship between global visual appearances and textual

features. The other embedding space captures the relation-ship between sequential appearances and textual features.The similarity scores of the two embedding spaces are thencombined using the weighted sum to put more emphasis onone embedding space than the other space.

Our motivation is twofold. Firstly, since videos and sen-tences require different attention, we aim to develop a morerobust similarity measurement by combining multiple embed-ding spaces in a weighted manner instead of using a hard-coded average. Secondly, in order to highlight spatial andtemporal information of videos according to a sentence, weneed to develop a mechanism that utilizes textual informationto emphasize spatial and temporal features.

B. Textual Embedding Network

We decompose the sentence to variable-length sequencesof tokens. Then, we transform each token into a 300-dimensional vector representation by using the pre-trainedGloVe model [23]. The length of the tokens depends onthe sentence. Therefore, in order to obtain a fixed-lengthmeaningful representation, we encode the GloVe vectors usingthe Gated Recurrent Unit (GRU) [24] with H hidden states,resulting in a vector φ(y) ∈ RH . We set H = 512. Thisembedded vector φ(y) goes to four processing modules: globaland sequential visual networks, spatial attention mechanism,and similarity aggregation. We further transform φ(y) in eachprocessing module.

C. Global Visual Network

The global visual network aims to obtain a vector represen-tation that is a general visual feature over the video frames.We divide the input video frames into N chunks, and oneframe is sampled from each chunk by random sampling. Weset N = 20. We extract visual features from the sampledframes using the ResNet-152 pre-trained on the ImageNetdataset [25]. Specifically, we resize the sampled frames to224 × 224, and then the ResNet encodes them, resulting in2048-dimensional N vectors. Note that we extract the vectorsdirectly from the last fully connected layer of the ResNet.Subsequently, we apply average-pooling to the N vectorsto merge them. Consequently, we obtain a feature vectorψg(x) ∈ R2048 containing a global visual feature in the video.

We learn a joint embedding space of the global visualand textual networks. As defined in Eq. (1) and (2), we useembedding functions to embed ψg(x) and φ(y) into a D-dimensional space.

fg(x) = Wgψg(x) + bg (1)

gg(y) = Wgφ(y) + bg (2)

There are learnable parameters Wg ∈ R2048×D, Wg ∈ RH×D,and bg, bg ∈ RD. We set D = 512 in this paper.

We use cosine similarity to measure the similarity sg(y, x)between the video and sentence in the joint embedding space.

sg(y, x) =fg(x) · gg(y)

‖ fg(x) ‖‖ gg(y) ‖(3)

3

N

�

Temporal

Max

PoolingResNet • • •

ψg

fg

N� • • •

Spatial

Attention

Mechanism

N

� • • •

LSTMResNetfs

ψs

ss

Spatial attended features ψa

Glove GRU• • •

φ

gg

gs

sg

A man is

playing

guitar

�S

imil

arit

y A

gg

reg

atio

n

s

N

Fin

al S

imil

arit

y

Global Visual Network

Textual Embedding Network

Sequential Visual Network

Fig. 1. Overview of the proposed framework. fg, fs, gg, and gs are embedding functions.

D. Sequential Visual Network

Similar to the global visual network, we divide the inputvideo frames into N chunks and take the first frames of eachchunk as the input of the sequential visual network. Then,we use the ResNet-152 to extract a 7× 7× 2048-dimensionalvector ψs(x) from each frame at the last convolution layerof the ResNet. The ψs(x) contains visual features in spatialregions. Considering spatial regions where we should payfurther attention may change by sentences, we need to explorerelationships between spatial and textual features. Therefore,we incorporate a spatial attention mechanism into the sequen-tial visual network. We apply the attention mechanism toψs(x) to emphasize spatial regions. Finally, we capture thesequential information of the video using a single layer Long-Short Term Memory (LSTM) [26] with an H-dimensionalhidden state. We denote the vector embedded by the sequentialvisual network as fs(x) ∈ RD.

The sequential visual network uses LSTM to capture a

meaningful sequential representation of the video. However,the LSTM transforms all spatial details of a video into aflat representation, resulting in losing its spatial reasoningwith the sentence. Therefore, we employ the spatial attentionmechanism in order to obtain a spatial relationship betweenvideo and sentence. Inspired by the work [9], we develop thespatial attention mechanism to learn with regions in a frameto attend for each word in the sentence.

We illustrated the spatial attention mechanism in Fig. 2. Themechanism associates the visual feature vector ψs(x) with thetextual vector φ(y) to produce a spatial attention map a ∈R7×7. Then, we combine ψs(x) with a to produce the spatialattended feature ψa(x). Formally, ψa(x) and a are defined inEq. (5) and (6), respectively.

Specifically, Eq. (4) defines an embedding function of thesequential visual network. γ represents the LSTM, and ⊗is element-wise product. The p and q mean intermediateoutputs, which are 512-dimensional vectors. We have learn-able parameters Wa ∈ R512×(7×7), Wp ∈ R(7×7×2048)×512,

4

φTextual feature

Visual feature

Spatialattention

map

Spatialattended feature

7

7

2048

7

7 2048

512

512

E

E

E

E : Embedding moduleψs

p

q

a ψa

Fig. 2. The spatial attention mechanism used in the sequential visual network.

Wq ∈ RH×512, ba ∈ R7×7, and bp, bq ∈ R512.

fs(x) = γ(ψa(x)) (4)ψa(x) = ψs(x)⊗ a (5)

a = softmax(tanh(Wa(p+ q) + ba)) (6)p = tanh(Wpψs(x) + bp) (7)

q = tanh(Wqφ(y) + bq) (8)

The joint embedding space of the sequential visual andtextual networks is formed by fs(x) and gs(y). We measurethe similarity ss(x, y) in this joint embedding space. Theformulations are described below, where Ws ∈ RH×D andbs ∈ RD are learnable parameters.

gs(y) = Wsφ(y) + bs (9)

ss(y, x) =fs(x) · gs(y)

‖ fs(x) ‖‖ gs(y) ‖(10)

E. Similarity Aggregation

There are many approaches to aggregate similarities. Anaverage is a straightforward approach. Some cases work wellon average. However, the average may cause unexpectedbehaviors if an inaccurate similarity is considerably highor low. Therefore, we adopt the Mixture of Experts fusionstrategy [27] for aggregating similarities with weights thatchanges according to the input sentence. Consequently, wecan emphasize one embedding space using the weights formerging the multiple similarities.

We propose to aggregate the similarities measured in mul-tiple embedding spaces so that we can produce the finalsimilarity s(x, y) with various understandings of videos andsentences. We illustrate the proposed similarity aggregationin Fig. 3. Specifically, we merge the similarities using theweight Wm ∈ R2 generated by considering the textual feature.Wt ∈ RD×2 is a learnable parameter. concat() is a functionthat concatenates given scalar values.

s(x, y) = Wm (concat(sg(x, y), ss(x, y))) (11)

Wm = softmax(Wtφ(y)) (12)

φTextual feature

E

ss

sg

C

Cosinesimilarity

Weight vector

WeightedSum Final

similarity

s

C : Concatenation

Wm

Fig. 3. Similarity aggregation

F. Optimization

As described in Section III-A, we optimize the proposedarchitecture by enforcing similarity of a video xi and itscounterpart sentence yi will be greater than similarities of thevideo xi and other sentence yj , such as s(xi, yi) ≥ s(xi, yj)or s(xj , yi). We achieve this by using the triplet rankingloss [28]–[30], where α is a margin.

Ls(xi, yi, yj) = max {0, α− s(xi, yi) + s(xi, yj)}(13)Lv(xi, yi, xj) = max {0, α− s(xi, yi) + s(xj , yi)}(14)

Given a dataset D = (xi, yi)Ni=1, with N pairs, we optimize

the following equation by stochastic gradient descent [31],[32].

L =

N∑i=1

N∑j=1

(Ls(xi, yi, yj) + Lv(xi, yi, xj)) (15)

IV. EXTENDABILITY

The proposed architecture has extendability of new embed-ding networks. The steps of the extension are straightforward.We build visual and textual networks and then merge themto form a joint embedding space. In this paper, we add em-bedding networks using the Inflated 3D Convolutional NeuralNetwork (I3D) model [33] so that the networks can capturevideo activities. We utilize the pre-trained RGB-I3D modelto extract embedding vectors from continuous 16 frames ofvideo. Consequently, a 1024-dimensional embedding vectorfe(x) is produced for each video.

The joint space is learnt by using fe(x) and the textualembedding vector ge(y), and the similarity se(x, y) is mea-sured in this joint embedding space. We transform φ(y) usingWe ∈ RD×1024 and be ∈ R1024. The similarity aggregation isalso straightforward to be extended. Specifically, we concate-nate all similarities and merge them with a weight W ′m ∈ RM ,

5

where M represents a number of embedding spaces. M = 3in this case.

ge(y) = Weφ(y) + be (16)

se(x, y) =fe(x) · ge(y)

‖ fe(x) ‖‖ ge(y) ‖(17)

We stress that the extendability is vital since this en-ables us to incorporate other feature extraction models intothe framework quickly. There are abundant approaches toextract features from videos [34]–[43]. Various aspects andapproaches are necessary for the understanding of videos andsentences.

V. EXPERIMENTS

We carried out the sentence-to-video retrieval task on thebenchmark dataset to evaluate the proposed method. The taskis retrieving the video associated with the query sentencefrom the test videos. We calculated similarities over the testvideos with the query sentence, and then we picked up videosaccording to the similarities in descending order.

We reported the experimental results using rank-basedperformance metrics, i.e., Recall@k and Median rank. TheRecall@k calculates the percentage of the correct video intop-k retrieved results. In this paper, we set k = 1, 5, 10.Median rank calculates the median of the ground-truth resultsin the ranking. For Recall@k, the bigger value indicates betterperformance. When Median Rank is a lower value, retrievedresults are closer to the ground-truth items. Therefore, a lowermedian rank means better retrieval performance.

Following the convention in sentence-to-video retrieval, weused the Microsoft Research Video to Text dataset (MSR-VTT) [44], which is a large-scale video benchmark for videounderstanding. The MSR-VTT contains 10000 video clipsfrom YouTube with 41.2 hours in total. The dataset providesvideos in 20 categories, e.g., music, people, and sport. Eachvideo is associated with 20 different description sentences.The dataset consists of 6513 videos for training, 497 videosfor validation, and 2990 videos for testing.

We evaluated four variants of the proposed method: single-space, sequential or I3D dual-space, and triple-space models.Firstly, the single-space model represents the proposed frame-work composed of the global visual and textual embeddingnetworks. These two networks form a single embedding space.Then, we measure the final similarity in the single embeddingspace. Secondly, the sequential dual-space model (dual-S) isthe proposed framework using two embedding spaces learnedby the global and sequential visual networks, and textualembedding networks. We measure the final similarity bymerging two similarities, sg and ss, as described in Eq. (11).Thirdly, the I3D dual-space model (dual-I) has global visualand I3D embedding networks. Lastly, the triple-space modeladded the I3D and textual embedding networks into the dual-space model. We produce the final similarity by merging sg,ss, and se with the similarity aggregation.

A. Sentence-to-Video Retrieval ResultsWe summarized the results of the sentence-to-video retrieval

task on the MSR-VTT dataset in Table I. We compared the

TABLE ITHE RESULTS OF SENTENCE-TO-VIDEO RETRIEVAL TASK ON MSR-VTTDATASET. THE BOLD AND UNDERLINED RESULTS REPRESENT THE FIRST-

AND THE SECOND-BEST, RESPECTIVELY.

Method R@1 R@5 R@10 MRVSE [28] 5.0 16.4 24.6 47VSE++ [18] 5.7 17.1 24.8 65Multi-Cues [22] 7.0 20.9 29.7 38Cat-Feats [45] 7.7 22 31.8 32Ours (single) 5.6 18.4 28.3 41Ours (dual-S) 7.1 19.8 31 30Ours (triple) 6.7 21.2 32.4 29

proposed method to the existing methods [18], [22], [28], [45]The proposed method obtained 7.1 (dual) at R@1, 21.2 (triple)at R@5, 32,4 at R@10 (triple), and 29 at MR (triple). Theseare competitive with [22], [45], which are the state-of-the-art.

VSE and VSE++ adopt a triplet ranking loss, and VSE++incorporates hard-negative samples into the loss to facilitatepractical training [46]. We adopted this strategy. The resultsshow that VSE++ performs significantly better than VSE atevery R@k. Although the single-space model and VSE++adopted similar loss functions, we found slight improvementsin performance. However, the dual-space model achieves muchbetter performance compared to VSE++. The results demon-strate the importance of using the sequential visual informationof videos for learning an embedding space.

Multi-Cues [22] calculates two similarities in separatedembedding spaces and then averages them to produce afinal similarity. The proposed method has higher performancecompared to Multi-Cues. The similarity aggregation strategyis the main difference between the proposed method andMulti-Cues. Note that the average suffers from aligning videosand sentences due to their variations. Thus, some videosneed global visual features, and some need sequential visualfeatures. The proposed method has a flexible strategy formerging similarities. The experimental results show that theproposed strategy is more useful for measuring similarity thana naive merging approach with equal importance to eachembedding space, such as average.

The Cat-Feats [45] embeds videos and sentences by con-catenating feature vectors extracted by three embedding mod-ules: CNN, bi-directional GRU, and max pooling. Cat-Featsis slightly better than the proposed method at R@1 andR@5, e.g., 7.7 and 7.1 for Cat-Feats and dual-space at R@1,respectively. Whereas, the proposed method (triple-space) out-performs Cat-Feat at R@10 and median rank, such as 32 and29 by Cat-Feat and the triple, respectively. These results implythat the proposed method and Cat-Feats can assist each other.There is a possibility to improve performance by incorporatingthe feature concatenation mechanism used in Cat-Feat into theproposed method.

Finally, the proposed method with triple-space achievesbetter results than single- and dual-space at three metrics:R@5, 10, and median rank. Therefore, The results show thatintegrating multiple similarities can lead to a better, reliableretrieval performance.

6

TABLE IIEVALUATION ON COMBINATIONS OF EMBEDDING SPACES

Embedding spacesGlobal Sequential I3D R@1 R@5 R@10 MR

single√

5.6 18.4 28.3 41dual-S

√ √7.1 19.8 31 30

dual-I√ √

6.8 20.2 30.6 30triple

√ √ √6.7 21.2 32.4 29

TABLE IIIEFFECTIVENESS OF THE SPATIAL ATTENTION

R@1 R@5 R@10 MRw/o attention 5.8 19.8 28.6 34w/ attention 7.1 19.8 31 30

VI. ABLATION STUDY

We carried out ablation studies to evaluate the componentsof the proposed method. We conducted the following threeexperiments.

A. Embedding Spaces

We changed the number of embedding spaces and de-veloped the four variants of the proposed architecture. Theexperimental results are shown in Table II. there are cer-tain improvements from single to multiple spaces at all themetrics. Therefore, we can verify the effectiveness of theproposed method that integrates multiple spaces. Subsequently,we compared the two duals, dual-S is better than the dual-I at R@1 and R@10, whereas dual-I is superior at [email protected], dual-S and dual-I can complement each other. Thetriple contains the embedding spaces of both of the duals,and it outperforms the single and the duals. Therefore, wecan confirm the effectiveness of the combination of multipleembedding spaces, which is the key insight of the proposedmethod for video and sentence understanding.

B. Spatial Attention Mechanism

We conducted experiments using dual-S with or without thespatial attention mechanism. The dual-S without the attentionencodes each frame using ResNet into a 2048-dimensionalvector. Then, the LSTM processed the sequences of thevectors. Table III shows the results. The dual-S with theattention achieves better results than without attention at allmetrics, R@1, R@5, R@10, and median rank. Therefore, wecan confirm that the proposed spatial attention mechanismimproves performance significantly.

Besides, we observed that the performance of the dual-Swith attention is almost competitive with dual-I. However,dual-S without the attention is worse than the dual-I. Consid-ering that the I3D model extracts useful spatial and temporalfeatures for action recognition, the dual-S without attentioncould not extract sequential features effectively. Whereas, thedual-S with attention obtained such features. We stress thatthis is another supportive evidence showing the effectivenessof the proposed attention mechanism.

TABLE IVPERFORMANCE COMPARISON ON SIMILARITY AGGREGATION USING

AVERAGE OR THE PROPOSED WEIGHTED SUM

R@1 R@5 R@10 MR

dual-S average 6.6 19.9 29.9 31weighted 7.1 19.9 31 30

dual-I average 6.7 20.2 30.4 30weighted 6.8 20.2 30.6 30

TABLE VSTATISTICS OF THE WEIGHTS IN SIMILARITY AGGREGATION FOR THE

GLOBAL AND SEQUENTIAL EMBEDDING SPACES

Average Min MaxGlobal 0.52 0.399 0.61Sequential 0.48 0.393 0.60

C. Similarity Aggregation

We investigated the impacts of similarity aggregation usingaverage or the proposed weighted sum. We used dual-S anddual-I for this investigating experiment. The experimentalresults are shown in Table IV. There are improvements bythe weighted sum at R@1, R@10, and MR in both of thedual-S and dual-I. Therefore, we confirmed the effectivenessof the proposed similarity aggregation module.

Furthermore, we performed an analysis of the weights in thesimilarity aggregation. As described in Eq. (12), the weightsare flexibly determined according to the given sentence. Inother words, the weight represents the importance of embed-ding spaces. We attempted to go further understanding by an-alyzing the weights. For simplicity, we used the dual-S modelin this analysis. Therefore, the analysis is on the importanceof global and sequential visual features. We summarized thestatistics of the weights in Table V. The statistics show that theproposed method assigns larger weights to the global features.

We show the accumulative step histogram for the globalweight in Fig. 4. The ratio reached 0.75 at the weight 0.5.Thus, 0.75 total instances received larger weights for the globalfeature. In contrast, the weights of the sequential feature arelarger only in the remained 0.25 instances. Therefore, theglobal feature is more critical than the sequential feature. Thus,dual-S aggressively used global features.

Figure 5 shows examples of videos and sentences with as-signed weight to the global feature. Videos containing explicitscenes tend to have larger weights on the global visual featuresince objects in the videos are relatively easy for detection.On the other hand, videos containing unclear scenes tend toassign larger weights to the sequential visual feature.

VII. CONCLUSION

In this paper, we presented a novel framework for embed-ding videos and sentences into multiple embedding spaces.The proposed method uses distinct embedding networks tocapture various relationships between visual and textual fea-tures, such as global appearance, sequential visual, and actionfeatures. We produce the final similarity between a video anda sentence by merging similarities measured in the embeddingspaces with the weighted sum manner. The proposed method

7

0.60 0.55 0.50 0.45 0.40Global weight

Ratio of total instances

1.0

0.8

0.6

0.4

0.2

0.0

0.75

Fig. 4. Accumulative step histogram of the weights of the global visualfeatures used in similarity aggregation

can flexibly determine the weights according to a given sen-tence. Hence, the final similarity can incorporate an essentialrelationship between video and sentence.

We carried out sentence-to-video retrieval experiments onthe MSR-VTT dataset to demonstrate that the proposedframework significantly improved the performance when thenumber of embedding spaces increased. The proposed methodachieved competitive results compared to the state-of-the-artmethods [22], [45]. Furthermore, we verify all the criticalcomponents in the proposed method through the ablationexperiments. Even though the components are individuallyuseful, their cooperation can generate significant improve-ments.

ACKNOWLEDGMENT

This study was partially supported by JSPS KAKENHIGrant Number 18K19772, 19K11848, and Yotta InformaticsProject by MEXT, Japan.

REFERENCES

[1] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato,and T. Mikolov, “Devise: A deep visual-semantic embedding model,” inAdvances in Neural Information Processing Systems, 2013, pp. 2121–2129.

[2] M. Engilberge, L. Chevallier, P. Perez, and M. Cord, “Finding beansin burgers: Deep semantic-visual embedding with localization,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition,2018, pp. 3984–3993.

[3] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkila, and N. Yokoya,“Learning joint representations of videos and sentences with webimage search,” in European Conference on Computer Vision (ECCV)Workshops. Springer International Publishing, 2016, pp. 651–667.

[4] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, andB. Russell, “Localizing moments in video with natural language,” inIEEE International Conference on Computer Vision (ICCV), 2017, pp.5804–5813.

[5] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activitylocalization via language query,” in IEEE International Conference onComputer Vision (ICCV), 2017, pp. 5277–5285.

[6] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko,“Multilevel language and vision integration for text-to-clip retrieval.” inAAAI, 2019, pp. 9062–9069.

[7] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments forgenerating image descriptions,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 39, no. 4, pp. 664–676, 2017.

[8] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embeddingspace for modeling internet images, tags, and their semantics,” Int. J.Comput. Vision, vol. 106, no. 2, p. 210233, 2014.

[9] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1359–1367.

[10] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang, “A comprehensivesurvey on cross-modal retrieval,” ArXiv, vol. abs/1607.06215, 2016.

[11] Z. Ji, H. Wang, J. Han, and Y. Pang, “Saliency-guided attention networkfor image-sentence matching,” in IEEE/CVF International Conferenceon Computer Vision (ICCV), 2019, pp. 5753–5762.

[12] H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, and W. Ma,“Unified visual-semantic embeddings: Bridging vision and languagewith structured meaning representations,” in IEEE/CVF Conference onComputer Vision and Pattern Recognition, 2019, pp. 6602–6611.

[13] J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang, “Look, imagine and match:Improving textual-visual cross-modal retrieval with generative models,”in IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018, pp. 7181–7189.

[14] Y. Huang, Q. Wu, W. Wang, and L. Wang, “Image and sentencematching via semantic concepts and order learning,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 42, no. 3, pp. 636–650, 2020.

[15] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semanticembeddings with multimodal neural language models,” ArXiv, vol.abs/1411.2539, 2014.

[16] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preservingimage-text embeddings,” in IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2016, pp. 5005–5013.

[17] K. Ye and A. Kovashka, “Advise: Symbolism and external knowledge fordecoding advertisements,” in European Conference on Computer Vision(ECCV), 2018.

[18] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: improvingvisual-semantic embeddings with hard negatives,” in British MachineVision Conference (BMVC), 2018, p. 12.

[19] D. Zhang, X. Dai, X. Wang, Y. Wang, and L. S. Davis, “Man: Momentalignment network for natural language moment retrieval via iterativegraph adjustment,” in IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR), 2019, pp. 1247–1257.

[20] Y. H. Tsai, S. Divvala, L. Morency, R. Salakhutdinov, and A. Farhadi,“Video relationship reasoning using gated spatio-temporal energygraph,” in IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), 2019, pp. 10 416–10 425.

[21] Y. Song and M. Soleymani, “Polysemous visual-semantic embeddingfor cross-modal retrieval,” in IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), 2019, pp. 1979–1988.

[22] N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury, “Learningjoint embedding with multimodal cues for cross-modal video-text re-trieval,” in ACM on International Conference on Multimedia Retrieval,2018, p. 1927.

[23] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors forword representation,” in Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), 2014, pp. 1532–1543.

[24] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation ofgated recurrent neural networks on sequence modeling,” in NIPS 2014Workshop on Deep Learning, December 2014, 2014.

[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016, pp. 770–778.

[26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” NeuralComput., vol. 9, no. 8, p. 17351780, 1997.

[27] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptivemixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87,1991.

[28] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural languagemodels,” in International Conference on Machine Learning, vol. 32,no. 2, 2014, pp. 595–603.

[29] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale onlinelearning of image similarity through ranking,” J. Mach. Learn. Res.,vol. 11, p. 11091135, 2010.

[30] A. Frome, Y. Singer, F. Sha, and J. Malik, “Learning globally-consistentlocal distance functions for shape-based image retrieval and classifica-

8

0.598694

a child and a man are dancing to a song

a girl wearing a hat is dancing

0.605575

0.600922

a group of teenagers dancing on top of a car

0.427479

a narrator talks about a computer programthat identifies genes

0.433274

there is a town many people are walking around the street

0.399078

a person is talking about what is in a hamster cage

Fig. 5. Examples of video and sentence with assigned weight to the global feature. The numbers and the sentences represent the weight for the global featureand query sentences, respectively.

tion,” in IEEE International Conference on Computer Vision, 2007, pp.1–8.

[31] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng,“Grounded compositional semantics for finding and describing imageswith sentences,” Transactions of the Association for ComputationalLinguistics, vol. 2, pp. 207–218, 2014.

[32] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,and S. Fidler, “Aligning books and movies: Towards story-like visualexplanations by watching movies and reading books,” in IEEE Interna-tional Conference on Computer Vision, 2015, pp. 19–27.

[33] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a newmodel and the kinetics dataset,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2017, pp. 4724–4733.

[34] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representationwith pseudo-3d residual networks,” in IEEE International Conferenceon Computer Vision, 2017, pp. 5534–5542.

[35] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-works,” in IEEE/CVF Conference on Computer Vision and PatternRecognition, 2018, pp. 7794–7803.

[36] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,“Temporal segment networks: Towards good practices for deep actionrecognition,” in European Conference on Computer Vision, 2016, pp.20–36.

[37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learn-ing spatiotemporal features with 3d convolutional networks,” in IEEEInternational Conference on Computer Vision, 2015, pp. 4489–4497.

[38] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks forvideo recognition,” in IEEE/CVF International Conference on ComputerVision, 2019, pp. 6201–6210.

[39] S. Zhang, S. Guo, W. Huang, M. R. Scott, and L. Wang, “V4d: 4dconvolutional neural networks for video-level representation learning,”in International Conference on Learning Representations, 2020.

[40] K. Simonyan and A. Zisserman, “Two-stream convolutional networksfor action recognition in videos,” in International Conference on NeuralInformation Processing Systems, 2014, p. 568576.

[41] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learningrealistic human actions from movies,” in IEEE Conference on ComputerVision and Pattern Recognition, 2008, pp. 1–8.

[42] N. Dalal, B. Triggs, and C. Schmid, “Human detection using orientedhistograms of flow and appearance,” in European Conference on Com-puter Vision, 2006, pp. 428–441.

[43] H. Wang and C. Schmid, “Action recognition with improved trajecto-ries,” in IEEE International Conference on Computer Vision, 2013, pp.3551–3558.

[44] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video descriptiondataset for bridging video and language,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2016, pp. 5288–5296.

[45] J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang, “Dualencoding for zero-example video retrieval,” in IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9338–9347.

[46] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based objectdetectors with online hard example mining,” in IEEE Conference onComputer Vision and Pattern Recognition, 2016, pp. 761–769.

Documents

Multiple Visual-Semantic Embedding for Video Retrieval ... · guitar Similarity Aggregation s N Final Similarity Global Visual Network Textual Embedding Network Sequential Visual