11
Deep Robust Multilevel Semantic Cross-Modal Hashing Ge Song 1,2,3 , Jun Zhao 4 , Xiaoyang Tan 1,2,3* 1 College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics 2 MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, China 3 Collaborative Innovation Center of Novel Software Technology and Industrialization, China 4 Nanyang Technological University, Singapore {sunge, x.tan}@nuaa.edu.cn, [email protected] Abstract Hashing based cross-modal retrieval has recently made significant progress. But straightforward embedding data from different modalities involv- ing rich semantics into a joint Hamming space will inevitably produce false codes due to the intrinsic modality discrepancy and noises. We present a novel Robust Multilevel Semantic Hash- ing (RMSH) for more accurate multi-label cross- modal retrieval. It seeks to preserve fine-grained similarity among data with rich semantics,i.e., multi-label, while explicitly require distances be- tween dissimilar points to be larger than a specific value for strong robustness. For this, we give an effective bound of this value based on the informa- tion coding-theoretic analysis, and the above goals are embodied into a margin-adaptive triplet loss. Furthermore, we introduce pseudo-codes via fus- ing multiple hash codes to explore seldom-seen se- mantics, alleviating the sparsity problem of similar- ity information. Experiments on three benchmarks show the validity of the derived bounds, and our method achieves state-of-the-art performance. 1 Introduction Cross-modal retrieval, aiming to search similar instances in one modality with the query from another, has gained in- creasing attention due to its fundamental role in large-scale multimedia applications [1; 2; 3; 4]. The difficulty of the similarity measurement of data from different modals makes this task very challenging, which is known as the heterogene- ity gap [5]. An essential idea to bridge this gap is map- ping different modals into a joint feature space such that they become computationally comparable, and it is widely exploited by previous work [6; 7; 8]. Among them, hashing- based method [9; 10], embedding the data of interest into a low-dimensional Hamming space, has gradually become the mainstream approach due to the low memory usage and high query speed of binary codes. While recent works have made significant progress, there are several drawbacks still in existing cross-modal hashing * Corresponding author apple orange banana The store shelves are full of various fruits, including apples, oranges and bananas banana apple orange The supermarket shelves are filled with fresh oranges just delivered from the orchard, There are many bananas and apples in the merchant ’s basket. 111 000 110 101 100 010 011 001 111 000 110 101 100 010 011 001 (b) Robust Multilevel Semantic Similarity-Preserved Codes Hashing Multi-Label Cross-Modal Data (a) Conflict Binary Semantic Similarity-Preserved Codes Figure 1: Illustration of the robust multilevel semantic similarity- preserved codes for multi-label cross-modal data. The points filled with red color in Hamming space denote the false codes caused by the intrinsic noise of data. methods. First, embedding different modalities data shar- ing the same semantic into the unified hash codes is hard, since the inherent modality discrepancy and noises will in- evitably cause false codes. Nevertheless, most approaches learn to hash straightforwardly with seldom considerations for this problem. They tend to embed data of different se- mantics into adjacency vertexes in Hamming space, which dramatically increases the collision probability of correct and false hash codes. Although some works [11] mentioned this problem and introduced the bit balance [12] constraint for maximizing the information provided by each bit, it is too simple and leads to the burden of seeking proper hyper- parameter for effective learning. Second, the query always is complicated in real applications, involving rich seman- tics, e.g., multi-labels. But numerous work could not run such queries to return satisfying results that are consistent with the humans’ cognition on semantic similarity. They [11; 13] focus on preserving simple similarity structures (i.e., sim- ilar or dissimilar) rather than more fine-grained ones, and the used similarity information is often very sparse. We illustrate these problems in Fig. 1 (a), the image and text data asso- ciate with multiple tags, and their similarity score is not just 0 or 1. If learning hash for only keeping 0-1 similarity with- out proper constraints, i.e., push away the dissimilar points and pull the similar points, the conflict codes will happen, e.g.,’101’. Besides, the data of ’apple, banana’ is more simi- arXiv:2002.02698v2 [cs.CV] 6 Oct 2020

arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

Deep Robust Multilevel Semantic Cross-Modal Hashing

Ge Song1,2,3 , Jun Zhao4 , Xiaoyang Tan1,2,3∗

1College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics2MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, China

3Collaborative Innovation Center of Novel Software Technology and Industrialization, China4Nanyang Technological University, Singaporesunge, [email protected], [email protected]

AbstractHashing based cross-modal retrieval has recentlymade significant progress. But straightforwardembedding data from different modalities involv-ing rich semantics into a joint Hamming spacewill inevitably produce false codes due to theintrinsic modality discrepancy and noises. Wepresent a novel Robust Multilevel Semantic Hash-ing (RMSH) for more accurate multi-label cross-modal retrieval. It seeks to preserve fine-grainedsimilarity among data with rich semantics,i.e.,multi-label, while explicitly require distances be-tween dissimilar points to be larger than a specificvalue for strong robustness. For this, we give aneffective bound of this value based on the informa-tion coding-theoretic analysis, and the above goalsare embodied into a margin-adaptive triplet loss.Furthermore, we introduce pseudo-codes via fus-ing multiple hash codes to explore seldom-seen se-mantics, alleviating the sparsity problem of similar-ity information. Experiments on three benchmarksshow the validity of the derived bounds, and ourmethod achieves state-of-the-art performance.

1 IntroductionCross-modal retrieval, aiming to search similar instances inone modality with the query from another, has gained in-creasing attention due to its fundamental role in large-scalemultimedia applications [1; 2; 3; 4]. The difficulty of thesimilarity measurement of data from different modals makesthis task very challenging, which is known as the heterogene-ity gap [5]. An essential idea to bridge this gap is map-ping different modals into a joint feature space such thatthey become computationally comparable, and it is widelyexploited by previous work [6; 7; 8]. Among them, hashing-based method [9; 10], embedding the data of interest into alow-dimensional Hamming space, has gradually become themainstream approach due to the low memory usage and highquery speed of binary codes.

While recent works have made significant progress, thereare several drawbacks still in existing cross-modal hashing

∗ Corresponding author

apple

orangebanana

The store shelves are full of various fruits, including apples, orangesand bananas

banana

apple

orange

The supermarket shelves are filled with fresh orangesjust delivered from the orchard,

There are many bananas andapples in the merchant ’s basket.

111

000110

101

100010

011

001

111

000110

101

100

010

011

001

(b) Robust Multilevel Semantic Similarity-Preserved Codes

Hashing

Multi-Label Cross-Modal Data

(a) Conflict Binary Semantic Similarity-Preserved Codes

Figure 1: Illustration of the robust multilevel semantic similarity-preserved codes for multi-label cross-modal data. The points filledwith red color in Hamming space denote the false codes caused bythe intrinsic noise of data.

methods. First, embedding different modalities data shar-ing the same semantic into the unified hash codes is hard,since the inherent modality discrepancy and noises will in-evitably cause false codes. Nevertheless, most approacheslearn to hash straightforwardly with seldom considerationsfor this problem. They tend to embed data of different se-mantics into adjacency vertexes in Hamming space, whichdramatically increases the collision probability of correct andfalse hash codes. Although some works [11] mentionedthis problem and introduced the bit balance [12] constraintfor maximizing the information provided by each bit, it istoo simple and leads to the burden of seeking proper hyper-parameter for effective learning. Second, the query alwaysis complicated in real applications, involving rich seman-tics, e.g., multi-labels. But numerous work could not runsuch queries to return satisfying results that are consistentwith the humans’ cognition on semantic similarity. They [11;13] focus on preserving simple similarity structures (i.e., sim-ilar or dissimilar) rather than more fine-grained ones, and theused similarity information is often very sparse. We illustratethese problems in Fig. 1 (a), the image and text data asso-ciate with multiple tags, and their similarity score is not just0 or 1. If learning hash for only keeping 0-1 similarity with-out proper constraints, i.e., push away the dissimilar pointsand pull the similar points, the conflict codes will happen,e.g.,’101’. Besides, the data of ’apple, banana’ is more simi-

arX

iv:2

002.

0269

8v2

[cs

.CV

] 6

Oct

202

0

Page 2: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

lar than that of ’orange’ to the data of ’apple, orange, banana,’but the two former’s hashing codes have the same Hammingdistance with the later’s.

We observe that if the representation ability of binary codeswith fixed length is adequate, hashing functions should at-tempt to preserve the complete fine-grained similarity struc-ture for more accurate retrieval. Simultaneously, we can ex-plicitly impose the distances between codes, whose similar-ities are zero, to be larger or equal than a specific value δto make the learned hash codes more robust. We call δ asrobust parameter and the learned codes as robust multilevelsemantic similarity-preserved codes. As shown in Fig. 1(b),we hash multi-label cross-modal data into a 3-bit Hammingspace according to their subtle similarity so that the distanceof irrelevant data’s codes is larger or equal than 3 (i.e., the ’or-ange’ and ’apple, banana’ data) and the distance of the morerelevant data’s codes (i.e., the ’apple, banana’ and ’orange,apple, banana’ data) is smaller than that of normal relevantdata’s codes (i.e., the ’orange’ and ’orange, apple, banana’data), then no conflict happens. Intuitively, the larger δ is,the more robust learned codes are, and the more levels of thesemantic similarity can be encoded. We could embody thisconstraint in the objective of hashing learning. However, en-dowing too large δ is not practical due to the limited codingpower of length-fixed binary codes and the uncertainty of thehashing process. The question thus becomes how to find anappropriate δ. In theory, as the Hamming space and the se-mantic similarity information are definite, the assumption thatall data are well hashing, i.e., no false codes, can be helpfulto reduce uncertainty and ease the derivation of the range ofeffective δ.

Here, we briefly describe our answer to the above ques-tion under the previous assumption. We would like to encodesemantic and similarity information of data to K-bit binarycodes. The maximum number of K-bit codes with minimumpairwise Hamming distance δ is certain. According to theinformation coding theory, the log of this number should belarger than the amount of semantic information, and the δ-bitsshould be able to encode the neighbor similarity informationof each point. Based on these facts, we derive the bounds ofproper δ and detail the process in Sec. 3.2.

Inspired by the above observation, we propose a novel deepRobust Multilevel Semantic Hashing (RMSH), which treatspreserving the complete multilevel semantic similarity struc-ture of multi-label cross-modal data with theoretically guar-anteed distance constraint between dissimilar data, as the ob-jective to learn hash functions. For this, a margin-adaptivetriplet loss is adopted to control the distance of dissimilarpoints in Hamming space explicitly, meanwhile embeddingsimilar points with a fine-grained level. To alleviate the spar-sity problem of similarity information, we further presentfusing multiple hash codes at the semantic level to gener-ate pseudo-codes, exploring the seldom-seen semantics. Themain contributions are summarized as follows.• A novel hashing method, named RMSH, is proposed to

learn multilevel semantic similarity-preserved codes foraccurate multi-label cross-modal retrieval. In which, to ex-ploit the finite Hamming space for improving the robust-ness of learned codes, we require the distance between

codes of dissimilar points satisfies larger than a specificvalue. For more effective learning, we perform a theoreti-cal analysis of this value and bring out its effective range.

• To capture the fine-grained semantic similarity structurein coupling with the elaborated distance constraint, wepresent a margin-adaptive triplet loss. Moreover, a newpseudo-codes network is introduced, tailored to exploremore rare and complicated similarity structures.

• Extensive experimental results demonstrate the effective-ness of the derived bounds, and the proposed RMSH ap-proach yields the state-of-the-art retrieval performance onthree cross-modality datasets.

2 Related workIn this section, we briefly review related works concerningcross-modal hashing methods.

The cross-modal hashing [14; 15; 16; 17; 18; 19] can begrouped into two types, unsupervised and supervised. Theformer utilizes the co-occurrence information of the multi-modal pair (e.g., image-text) to maximize their correlationin the common Hamming space. The representative is Col-lective Matrix Factorization Hashing (CMFH) [20], whichgenerates unified hash codes for multiple modalities by per-forming collective matrix factorization from different views.The supervised ones aim to preserve semantic similarity.Semantic Correlation Maximization (SCM) [21] uses hashcodes to reconstruct semantic similarity matrix. Semantics-Preserving Hashing (SePH) [22] minimizes KL-divergencebetween the hash codes and semantics distributions. Re-cently, the success of deep learning prompted the develop-ment of cross-modal hashing. Cross-Modal Deep VariationalHashing (CMDVH) [23] infers fusion binary codes frommulti-modal data as the latent variable for model-specific net-works to approximate. Self-supervised Adversarial Hash-ing (SSAH) [24] incorporates the adversarial learning intocross-modal hashing. Equally-Guided Discriminative Hash-ing (EGDH) [10] jointly considers semantic structure and dis-criminability to learn hash functions. Similar to CMDVH,but Separated Variational Hashing Networks (SVHNs) [13]learns binary codes of semantic labels as the latent variable.Despite their effectiveness, most of them ignore the exploita-tion of limited representative of binary codes for reducing theimpact of the modality gap, and they fail to preserve fine sim-ilarity for more accurate multi-label cross-modal retrieval.

3 The proposed approachIn this section, we first give the problem definition and theeffective bounds of the robust parameter defined in Sec. 1.Then, the configurations of the proposed RMSH are detailed,including two hashing networks, a pseudo-code network, andthe corresponding objective function. An overview of theRMSH is illustrated in Fig. 2.

3.1 The Problem DefinitionWithout loss of generality, we use bi-modality (e.g., imageand text) retrieval for illustration in this paper. Assume thatwe are given a multi-modal dataset D = X, Y, L, where

Page 3: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

DNN

DNN

Pseudo binary-like codes

1) Quantization loss

Loss function

0.8 -0.9 0.7 -0.3 0.6 1 -1 1 -1 1

2) Weighted classification losssemantic label

3) Margin-adaptive triplet loss

𝐃𝐇𝐚𝐦𝐦𝐢𝐧𝐠 , − 𝐃𝐇𝐚𝐦𝐦𝐦𝐢𝐧𝐠 ,

margin(M)

𝐌 = 𝛅 𝐬𝐢𝐦𝐢𝐥𝐚𝐫𝐢𝐭𝐲 𝐝𝐢𝐬𝐜𝐫𝐞𝐩𝐚𝐧𝐜𝐲

loss

Hamming Space

𝒅 ≥ 𝜹

query

Preserve multi-level similarity of similar points

keep at least 𝜹distance with dissimilar points

Robust multilevel semantic similarity-preserved codes

Image binary-like codes

Pseudo Codes Network

Text binary-like codes

……

…… Robust Parameter 𝛅

A roasting pan full with apples, carrots, potatoes, and meat.

there are many green and red apples on this counter.

A very tasty looking fruit salad on a piece of pita bread.

Bushels of ripe oranges displayed in a market.

A pizza with pepperoni and black olives is displayed on a white platter.

Some oranges, raspberries, and blueberries all mixed together..

multi-modal triplets

image

TextNLP model

CNN

hash layer

hash layer

Image Hashing

Text Hashing

𝒅 ≥ 𝜹

0.3

0.6

0.8

0

DNN

Figure 2: The architecture of the proposed RMSH includes two modality-specific hashing networks and a pseudo-codes network (PCN). TheRMSH first takes multi-modal triplets into hashing networks to obtain hash codes, which are then manipulated by PCN to generated codeswith seldom-seen semantics. Finally, a margin-adaptive triplet loss equipped with the pre-computed robust parameter δ is coupled with theweighted classification loss and quantization loss to learn robust multilevel semantic similarity-preserved codes.

X = xiNi=1, Y = yiNi=1, L = liNi=1 denotes the imagemodality, text modality, and corresponding semantic labels,respectively. xi can be features or raw pixels of images, yi istextual description or tags, semantic label li ∈ 0, 1C , whereC is the number of class. For any two samples, we definethe cross-modal similarity S as Sij =

|li∩lj |max|li|,|lj | . Given

training data X, Y and S, our goal is to learn two hash func-tions: h(x)(x) = b(x) ∈ −1, 1K for the image modalityand h(y)(y) = b(y) ∈ −1, 1K for the text modality, whereK is the length of the binary code, such that the similarityrelationship S is preserved and distances between dissimilardata are larger or equal than a positive integer δ (< K) forrobustness, i.e., ∀ xi ∈ X, yj , yk ∈ Y, if Sij ≥ Sik, thenthe Hamming distance of their binary codes should satisfydH(b

(x)i , b

(y)j ) ≤ dH(b

(x)i , b

(y)k ), and vice versa, if Sij = 0,

then dH(b(x)i , b

(y)j ) ≥ δ, δ is named as robust parameter. In-

tuitively, a larger δ makes codes more robust, but it cannot betoo large due to the finite representation power of K-bit bi-nary codes. In what follows, we investigate how to properlyset this parameter to obtain the balance between robustnessand compactness.

3.2 Bounds of Effective Robust ParameterTo learn robust hashing codes effectively, we give the rangeof effective robust parameter δ in this section. In essential,the goal of deep supervised hashing is encoding semantic in-formation H(LLL) and similarity information H(S∗,:S∗,:S∗,:) of data1

by K-bit binary codes, where H denotes entropy function.Therefore, according to the coding theory [25], we have thefollowing two facts. (1) To make sure that different seman-tics have unique codes and the distances between the codesof irrelevant samples are larger or equal than δ, the num-ber of K-bit binary codes that satisfy pairwise minimum dis-tance δ should be larger than 2H(LLL). (2) δ bits should be ableto encode the neighborhood semantic similarity information

1 The semantic label l of data can be seen as the i.i.d. random variablefrom distribution P (L). Because S is constructed by l, each rowS∗,: of S can also be seen as random variable.

H(S∗,:S∗,:S∗,:) of each sample ∗. Based on these two facts, we de-rive the bounds of effective δ as follows. For simplicity’ssake, we assume that all data sharing the same semantic areembedded into the same codes to eliminate the influence ofnoise.1) Upper bound. By the fact (1), we have:

2H(LLL) ≤ A(K, δ) (1)

where A(K, δ) denotes the maximum number of K-bit bi-nary codes with pairwise minimum Hamming distance δ, i.e.,A(K, δ) = |b|b ∈ −1, 1K , for ∀ bi 6= bj , dH(bi, bj) ≥δ|. To obtain the range of effective δ, we need estimateA(K, δ). However, an accurate estimation ofA(K, δ) is chal-lenging, which is still unsolved. Alternatively, its bounds arewell studied [26; 27; 28]. Towards our goal, we introducea well-known lower-bound of A(K, δ) to derive the upper-bound of δ.Lemma 1. (Gilbert-Varshamov Bound [29]). Given δ, K ∈N be such that δ ≤ K, we have:

2K∑δ−1i=0

(Ki

) ≤ A(K, δ) (2)

To proof the Lemma 1, we first give the definition of Ham-ming Ball and then give the proof.Definition 1. (Hamming Ball). Let δ, K ∈ N such that δ ≤K. For ∀ x ∈ Ω = −1, 1K , the ball BKδ (x) denotes the setof vectors with distance from the x less than or equal to δ andis defined as BKδ (x) = y ∈ Ω|dH(y, x) ≤ δ. Its volume isdefined as |BKδ (x)| =

∑δi=0

(Ki

), and |BKδ | for short.

Proof. Let Ω(δ) denotes the greatest subset of Ω = −1, 1Kthat ∀ c, c′ ∈ Ω(δ), dH(c, c′) ≥ δ and A(K, δ) = |Ω(δ)|,ε = δ − 1. For ∀ y ∈ Ω, there exists at least one c ∈ Ω(δ)

such that dH(y, c) ≤ ε, otherwise y∈ Ω(δ). Hence, Ω is con-tained in the union of all balls BKε (c), c ∈ Ω(δ), that is Ω ⊆⋃c∈Ω(δ) BKε (c), then we have |Ω| ≤

∑c∈Ω(δ) |BKε (c)| =

|Ω(δ)| · |BKε (c)|, thus A(K, δ) = |Ω(δ)| ≥ |Ω||BKε (c)| =

2K∑δ−1i=0 (Ki )

.

Page 4: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

As we see that if 2H(LLL) is smaller than the lower-bound ofA(K, δ), Eq. (1) also holds. By the Lemma 1, we thus have:

2H(LLL) ≤ 2K∑δ−1i=0

(Ki

) (3)

Still, the computation of∑δ−1i=0

(Ki

)in Eq. (3) is hard.

Here, we show an upper-bound of this value via the entropybound of the volume of a Hamming ball.Lemma 2. (Volume of Hamming ball). Given δ, K ∈ N andδ = pK, p ∈ [0, 1/2], we have:

|BKδ | ≤ 2H(p)K (4)

where H(p) = −p log p− (1− p) log(1− p).

Proof.

|BKδ |2H(p)K

=

∑δi=0

(Ki

)p−pK(1− p)−(1−p)K

=

pK∑i=0

(K

i

)(1− p)K(

p

1− p)pK

<

pK∑i=0

(K

i

)(1− p)K(

p

1− p)i(since

p

1− p< 1, i ≤ pK)

=

pK∑i=0

(K

i

)(1− p)K−ipi = 1(binomial theorem)

By the Lemma 2 and Eq. (3), we can obtain a computablelower-bound of A(K, δ). By this bound, we give the theoremabout the upper bound of effective δ in our problem.Theorem 1. For information H(LLL), if using K-bit binarycodes to encode l ∈ LLL such that for ∀ li 6= lj , dH(bi, bj) ≥ δ,δ ≤ K/2, then, δ should satisfy:

H(δ − 1

K

)≤ 1− H(LLL)

K(5)

Proof. By the the Lemma 2 and to make Eq. (3) always hold,we have:

2H(LLL) ≤ 2K

2H( δ−1K )K

≤ 2K∑δ−1i=0

(Ki

) ≤ A(K, δ)

whence H( δ−1K ) ≤ 1− H(LLL)

K .

In experiments, we can estimate H(LLL)=

H(LLL1,LLL2,. . . ,LLLC) =∑Ci=1H(LLLi) by assuming inde-

pendence between tags for simplicity and each tagLLLi followsa Bernoulli distribution, i.e., LLLi ∼ B(θi). The estimation ofθi can be θi = 1

N

∑Nn=1 I(ln,i = 1), l ∈ L, where I∗ is

indicator function. Then, given code length K and δ, we caninvestigate whether the value of δ is larger than the effectiverange. In our experiments, we found that the upper-bound ofthe effective δ is always less than K/2, which is preferred setby previous works2.

2 The bit balance, firing each bit as 1 or -1 with 50%, is the specialcase of δ = K/2, which can be derived by the Maximum EntropyPrinciple but ignores the limited coding ability of codes.

2) Lower bound. Let H∗ denote H(S∗,:S∗,:S∗,:), by the fact (2) wehave δ ≥ max∗=1:NH∗. However, as S is sparse, we couldrelax δ ≥ max∗=1:NH∗ by making the δ be larger thanmost H∗ to ensure δ bits can encode the semantic neighborsof each sample with a certain probability. By Chebyshev’sInequality, we have:

PH∗ ≤ δ ≥ 1− D(H∗)

(δ − E(H∗))2(6)

where E(H∗) = 1N

∑Ni=1Hi, D(H∗) = 1

N

∑Ni=1(Hi −

E(H∗))2. If PH∗ ≤ δ = p, 1 > p > 1/2, we have:

δ ≥

√D(H∗)

1− p+ E(H∗) (7)

We estimate Hi = −∑|li|j=1 p(si = j

|li| ) log p(si = j|li| ) ≤

|li|, p(si = j|li| ) =

∑∗ I(Si,∗=j/|li|)∑

∗ I(|li∩l∗|=Si,∗·|li|) , si ∈ Si,:. For sim-ple computation, we substitute Hi with |li| in experiments.

Combing Eqs. (5) and (7), we obtain the final bounds of ef-fective robust parameter δ. In practice, we can pre-determinethe range of high quality δ by substituting the correspondingvalue into Eq. (5) and (7) when given training set informa-tion and avoid the burden of seeking proper value for effec-tive learning. It is worth note that the performance of dif-ferent δ within bounds is not significantly different, settingany value in bounds is OK in practice. Furthermore, if wewant to find optimal δ, our theoretical results can guaranteethe performances within bound are better than those withoutbound, which significantly reduces the search range of op-timal hyper-parameter. Therefore, setting the special valuewithin the bound is not difficult. The experimental results inSec. 4.4 demonstrate the effectiveness of the derived bounds.

3.3 Network ArchitectureThe architecture of RMSH is illustrated in Fig. 2, whichseamlessly integrates three parts: the image hashing, the texthashing, and the pseudo-codes networks.

Deep Hashing Networks. We build image and texthashing as two deep neural networks via adding two fully-connected layers with tanh function on the top feature layerof commonly-used modality-specific deep models, e.g., CNNmodel ResNet for image modality, BOW or sentence2vectorfor text modality. These two layers as the hash functionstransform the feature xi, yi into binary-like codes z

(x)i ,

z(y)i ∈RK . Then, we obtain the hash codes of modality ∗ byb(∗)i = sgn(z

(∗)i ), where sgn(z) is the sign function.

Pseudo-codes Network (PCN). The input data are alwaysimbalanced due to the sparsity problem of multi-labels, whichresults in the difficulty of capturing complicated similaritystructure across modality. To handle this problem, inspiredby work [30], we propose pseudo-codes networks to manipu-late the binary-like codes at the semantic level for generatingcodes of rare semantics, which are mixed with original codesto explore complicated similarity structure. We defined twotypes of code operations, i.e., union and intersection, as two

Page 5: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

fully-connected networks:

z1⊕2 = fu(z1, z2) ≡ tanh(Wu[z1, z2])

z1⊗2 = ft(z1, z2) ≡ tanh(Wt[z1, z2])(8)

where [·, ·] denotes the concatenation operation. Wu,Wt ∈RK×2K are weights to be learnt. z1⊕2 = fu(z1, z2) withlabel lu = l1 ⊕ l2 = l1 ∪ l2. z1⊗2 = ft(z1, z2) with labellt = l1 ⊗ l2, defined as l1 ∩ l2.

3.4 Objective FunctionTo preserve multilevel semantic similarity and fully exploitthe space between dissimilar points in Hamming space, weformulate the goal in Sec 3.1 as follows: for ∀ bi, bj , bk, 1)if Si,j>Si,k> 0, then dH(bi, bk)−dH(bi, bj)≥(Si,j−Si,k)·δ.2) if Si,j=0, then dH(bi, bj)≥δ. The first goal is to pre-serve the ranking information of similar points by an adaptivemargin. The second goal is to keep the minimum distancebetween dissimilar points. We couple these two goals in ascheme of margin-adaptive triplets, that is:

Lt(bi, bj , bk) = yi,j · yi,k · [dH(bi, bj)− dH(bi, bk) + α]+

+ (1− yi,j)[δ − dH(bi, bj)]+ + (1− yi,k)[δ − dH(bi, bk)]+(9)

where [·]+ = max0, ·, yi,j = I(Si,j > 0), yi,k =I(Si,k > 0), I is indication function, α = δ(Si,j − Si,k), isadaptive to the discrepancy of similarity level and δ controlsthe distance of dissimilar points. Given a triplet of multi-modal data x1,2,3, y1,2,3 and their labels l1,2,3, we canobtain five hash codes for each modality, i.e., b1,2,3, andb4 = sgn(fu(z1, z2)), b5 = sgn(ft(z1, z2)). Note thatin principle we can generate more pseudo-codes using thePCN network but for simplicity and w.l.o.g., here only a setinvolving two codes are used. We partition them into theintra-modality triplets b1,2,3, b1,2,4, b1,2,5, and the inter-modality triplets b(y)

1 , b(x)2,3, b

(y)2 , b

(x)1,3, b

(y)3 , b

(x)1,2,

each of which reflects some aspect of the similarity structureand corresponds to a separate loss term in the final loss.

To mine the effective information in each triplets quickly,we select one point b∗ from each triplet with most tags as thereference point and re-order other points as b∗, b∗,1, b∗,2,where b∗,1 is the one that is more similar to b∗ than the otherb∗,2. Then, we substitute them into Eq. (9) and obtain theused triplet loss as Lt(b1,2,3) = Lt(b∗, b∗,1, b∗,2), whereα =

|l∗∩l∗,1|−|l∗∩l∗,2||l∗| · δ.

The final triplet loss for one modality is defined as:

Ltr(b(x)

1,2,3) = λ1(

5∑i=3

Lt(b(x)

1,2,i)) + λ2(Lt(b(y)1 , b

(x)

2,3)

+ Lt(b(y)2 , b

(x)

1,3) + Lt(b(y)3 , b

(x)

1,2))

(10)

where λ1, λ2 control the weight of intra-modal and inter-modal similarity loss, respectively.

Besides, we adopt weighted cross-entropy loss to ensurethat each individual hashing codes consistent with its own se-mantics, especially for the pseudo-codes from fusion, and theloss is defined as follows:

Lcl(b) = −C∑

j=1

(wp · lj log(lj) + (1− lj) log(1− lj)

)(11)

where l is the predicted value of RMSH, andwp is the weightof positive samples.

Finally, the overall objective for N training triplets bi1,bi2,bi3Ni=1 of RMSH is defined as follows:

minθL =

N∑i=1

( 3∑j=1

Lcl(b(x,y)ij ) + Ltr(b(x,y)

i1,2,3)

+ λ3(Lcl(b(x,y)i4 ) + Lcl(b(x,y)

i5 )))

s.t. b(x), b(y) ∈ −1, 1K

(12)

where λ3 balances the classification loss of pseudo-codes, θis the parameter of image hashing, text hashing, and PCN.

3.5 OptimizationSince the Eq. (12) is a discrete optimization problem, we fol-low the work [31] to solve this problem. We relax b as zby introducing quantization error ||z − b||22, and substituteHamming distance with Euclidean distance, i.e. d(z1, z2) =||z1 − z2||22. The Eq. (12) is rewritten as follows:

minθ,bL =

N∑i=1

( 3∑j=1

Lcl(z(x,y)ij ) + λ3(

5∑j=4

Lcl(z(x,y)ij ))

+ Ltr(z(x,y)i1,2,3) + λ4

3∑j=1

||z(x,y)ij − bij ||22

)s.t. b ∈ −1, 1K

(13)

where z(∗)i4 = fu(z

(∗)i1 , z

(∗)i2 ), z(∗)

i5 = ft(z(∗)i1 , z

(∗)i2 ). λ4 con-

trols the weight of quantization error. In the training phase,we let bi = b

(x)i = b

(y)i for better performance. We adopt an

alternating optimization strategy to learn θ and b as follows.1) Learn θ with b fixed. When b is fixed, the optimization

for θ is performed using stochastic gradient descent based onAdaptive Moment Estimation (Adam).

2) Learn b with θ fixed. When θ is fixed, the problem inEq. (13) can be reformulated as follows:

maxb

λ4

N∑i=1

(bTi (z

(x)i + z

(y)i ))

s.t. b ∈ −1, 1K(14)

We can find that the binary code bi should keep the same signas z(x)

i + z(y)i , i.e., bi = sgn(z

(x)i + z

(y)i ).

4 Experiments and Discussions4.1 DatasetsMicrosoft COCO [32] contains 82,783 training and 40,504testing images. Each image is associated with five sentences(only the first sentence is used in our experiments), belong-ing to 80 most frequent categories. We use all training setfor training and sample 4,956 pairs from the testing set asqueries.

MIRFLICKR25K [33] contains 25,000 image-text pairs.Each point associates with some of 24 labels. We remove

Page 6: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

pairs without textual tags or labels and subsequently get18,006 pairs as the training set and 2,000 pairs as the test-ing set. The 1386-dimensional bag-of-words vector gives thetext description.

NUS-WIDE [34] contains 260,648 web images, belong-ing to 81 concepts. After pruning the data without any labelor tag information, only the top 10 most frequent labels andthe corresponding 186,577 text-image pairs are kept. 80,000pairs and 2,000 pairs are sampled as the training and testingsets, respectively. The 1000-dimensional bag-of-words vec-tor gives the text description. We sampled 5,000 pairs of thetraining set for training.

4.2 Evaluation protocol and BaselinesEvaluation protocol. We perform cross-modal retrieval withtwo tasks. (1) Image to Text (I→ T): retrieve relevant datain the text training set using a query in the image test set.(2) Text to Image (T → I): retrieve relevant data in the im-age training set using a query in the text test set. Partic-ularly, to evaluate the retrieval performance of multi-labeldata, we adopt the Normalized Discounted Cumulative Gain(NDCG) [35] as the performance metric, which is defined asfollows, since the NDCG can evaluate the ranking by penal-izing errors according to the relevant score and position.

NDCG@p =1

Z

p∑i=1

2ri − 1

log(1 + i)(15)

where Z is the ideal DCG@p and calculated form the correctranking list, and p is the length of list. r(i) = |lq ∩ li| denotesthe similarity between the i-th point and the query, followingthe definition in work [36]. lq and li denote the label of thequery and i-th position point.

Besides, the commonly-used precision-recall curve is usedas a metric to evaluate the performance, where we considertwo points are relevant if they share at least one tag.Baselines. We compare our proposed RMSH with six cross-modal hashing methods CMFH [20], UCH [18], SePH [22],DCMH [11], TDH [37], SSAH [24], SVHN [13]. For afair comparison, all shallow methods take the deep off-the-shelf features extracted from ResNet-152 [38] pre-trained onImageNet as inputs, and all deep models are implementedcarefully with the same CNN sub-structures for image dataand the same multiple fully-connect layers for textual data.The hyper-parameters of all baselines are set according to theoriginal papers or experimental validations.

4.3 Implementation details.Our RMSH method is implemented with Tensorflow3. Weuse ResNet [38] pre-trained on ImageNet as the CNN modelin image hashing to extract features of image modality. Forthe MS COCO dataset, the 4800-dimensional Skip-thoughtvector [39] gives the sentence description. The structure oftwo hashing layers is D(∗) → 1024 → K-bit length, whereD is the dimension of * modality’s feature. For fast training,we only optimize the hashing layers with the pre-trained CNNsub-structures (no fine-tuned on the target dataset for a fair

3 https://www.tensorflow.org/

comparison with other shallow methods) fixed for all deepmethods. In all experiments, we set the batch size to 128,epochs to 50, λ1, λ2, λ3, and λ4 to 0.01, 0.1, 0.1, 0.1, learningrate to 0.001, δ by the derived results in Sec. 3.2, wp to 20for NUS-WIDE and MIRFLICKR25K datasets, 30 for MSCOCO dataset.

4.4 Experimental ResultsTable 1 reports the NDCG@500 results of compared meth-ods and the precision-recall cures with code length 64 onthree datasets are shown in Fig. 3. From Table 1, we observethat our RMSH method substantially outperforms other com-pared methods on all used datasets. Specifically, compared tothe best deep method SVHN, our RMSH obtains the relativeincrease of 0.2%∼5.86%, 0.1%∼2.58%, and 0.1%∼7.94%with different bits for two cross-modal retrieval tasks onMIRFLICKR25K, NUS-WIDE, and MS COCO datasets, re-spectively. Because the SVHN method is learning to pre-serve the binary cross-modal similarity structures, its hashingcodes cannot capture the fine-grained ranking information,thus achieve inferior NDCG scores. In contrast, the RMSHlearns the more complicated similarity with more robustnessto the modality discrepancy and noise. From Fig. 3, we seethat the precision-recall evaluations are consistent with theNDCG scores for two cross-modal retrieval tasks. These re-sults re-confirm the superiority of our RMSH method.

Fig. 4 and Fig. 5 show the ’Text to Image’ and ’Imageto Text’ search example of compared methods on COCOdataset. As can be seen, the RMSH method tends to retrievemore relevant images than others for the query containingcertain concepts, e.g., person and dog. From Fig. 5, we ob-serve that the RMSH method can roughly capture the simi-larity across images and text descriptions with multiple tags.Compared to other methods, the top 5 retrieved textual resultsof RMSH are more meaningful and relevant.

To visualize the quality of learned codes by RMSH, we uset-SNE tools to embed the 128 bits testing binary-like featuresof NUS-WIDE datasets into 2-dimension spaces and visual-ize their distribution in Fig. 6. As can be seen, both image andtext features provide a better separation between different cat-egories in the common space. Also, the features belonging tothe same class from different modalities appear to be com-pact. These results indicate that the RMSH can preserve notonly the multilevel semantic similarity of intra-modal data butalso that of the inter-modal data.

4.5 Analysis and DiscussionImpact of robust parameter δ.To validate the correctness of the derived bounds of effec-tive robust parameter δ in Sec. 3.2, we separately tune δ inK/40, K/20, K/8, 3K/20, K/4, K/3, K/2, 2K/3 withother parameters fixed and report the retrieval performance inFig. 7, where K=128. We also mark the bounds of effective δderived by Eq. (5) and (7) in Fig. 7. We see that the NDCGscores of the RMSH method on all datasets are higher whensetting δ in the range of bounds than when setting it out ofthe range. This result experimentally proves the correctnessof the bounds for more robust cross-modal hash learning.

Page 7: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

Table 1: Comparison (NDCG@500) of different cross-modal hashing methods on three datasets with different code length. The best result isshown in boldface.

Task Method MIRFLICKR25K NUS-WIDE MS COCO16bits 32bits 64bits 128bits 16bits 32bits 64bits 128bits 16bits 32bits 64bits 128bits

I→ T

CMFH 0.2926 0.3154 0.3193 0.3282 0.4875 0.5012 0.5270 0.5394 0.2224 0.2571 0.2899 0.3098UCH 0.3017 0.3114 0.3197 0.3362 0.5202 0.5175 0.5227 0.5175 0.2239 0.2588 0.2757 0.3143SePH 0.3310 0.3522 0.3667 0.3811 0.5726 0.5809 0.5867 0.6043 0.2266 0.2681 0.3113 0.3375

DCMH 0.3857 0.4013 0.4162 0.4046 0.5942 0.6239 0.6353 0.6392 0.2257 0.2661 0.3089 0.3147TDH 0.4379 0.4491 0.4491 0.4458 0.6196 0.6242 0.6305 0.6324 0.2447 0.2692 0.2946 0.3138

SSAH 0.4203 0.4392 0.4626 0.4753 0.6026 0.6331 0.6263 0.6144 0.1955 0.2797 0.3157 0.3917SVHN 0.4186 0.4354 0.4675 0.4895 0.6039 0.6324 0.6486 0.6501 0.2526 0.2930 0.3270 0.3811RMSH 0.4592 0.4940 0.5180 0.5183 0.6148 0.6401 0.6497 0.6574 0.3037 0.3724 0.3855 0.3998

T→ I

CMFH 0.2857 0.3079 0.3115 0.3138 0.4642 0.4775 0.4998 0.5091 0.1982 0.2182 0.2398 0.2539UCH 0.3027 0.3000 0.3157 0.3051 0.4923 0.5103 0.5008 0.5232 0.1895 0.2356 0.2569 0.2687SePH 0.3085 0.3245 0.3168 0.3582 0.5264 0.5285 0.5245 0.5337 0.1968 0.2367 0.2627 0.2839

DCMH 0.3719 0.3819 0.3877 0.3894 0.5375 0.5451 0.5438 0.5463 0.1885 0.2122 0.2544 0.2601TDH 0.3789 0.4084 0.3924 0.3937 0.5329 0.5340 0.5368 0.5306 0.2249 0.2420 0.2521 0.2712

SSAH 0.3648 0.3815 0.3710 0.3707 0.5262 0.5428 0.5583 0.5375 0.1870 0.2382 0.2469 0.3114SVHN 0.3985 0.4434 0.4483 0.4604 0.5486 0.5702 0.5831 0.5885 0.2313 0.2664 0.2780 0.3163RMSH 0.4379 0.4533 0.4562 0.4628 0.5744 0.5921 0.5900 0.5999 0.2897 0.3080 0.3050 0.3175

0 0.2 0.4 0.6 0.8 1Recall

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Pre

cisi

on

RMSHSVHNSSAHTDHDCMHSePHUCHCMFH

0 0.2 0.4 0.6 0.8 1Recall

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pre

cisi

on

RMSHSVHNSSAHTDHDCMHSePHUCHCMFH

0 0.2 0.4 0.6 0.8 1Recall

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Pre

cisi

on

RMSHSVHNSSAHTDHDCMHSePHUCHCMFH

@MIRFLICKR25K (I→T) @NUS-WIDE (I→T) @MS COCO (I→T)

0 0.2 0.4 0.6 0.8 1Recall

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Pre

cisi

on

RMSHSVHNSSAHTDHDCMHSePHUCHCMFH

0 0.2 0.4 0.6 0.8 1Recall

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

on

RMSHSVHNSSAHTDHDCMHSePHUCHCMFH

0 0.2 0.4 0.6 0.8 1Recall

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

Pre

cisi

on

RMSHSVHNSSAHTDHDCMHSePHUCHCMFH

@MIRFLICKR25K (T→I) @NUS-WIDE (T→I) @MS COCO (T→I)

Figure 3: Precision-Recall curves on MIRFLICKR25K, NUS-WIDE, and MS COCO datasets.

Page 8: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

CMFH

UCH

SePH

DCMH

TDH

SSAH

SVHN

RMSH

A motorcyclist riding down the street with a dog in a basket.Query Cor. image Cor. Tags

Person, Car, Dog Motorcycle

Figure 4: Retrieval examples of ’Text to Image’. Red border de-notes irrelevant; blue denotes sharing one tag with the query, greendenotes sharing two tags with the query at least.

TDH

SSAH

CMFH

UCH

SePH SVHN

DCMH RMSH

A kid is in the air on a skateboard.

Query image Cor. Text description

A couple of zebra's as one yells at the photographer. A young boy riding a skateboard near a set of steps. A boy swinging a bat at a baseball game. A young man reaching out towards a yellow frisbee. A young man tossing a yellow frisbee in a park.

A black image of a man in a suit wearing glasses walking through a door. Two people flying a kite above pine trees. People on a skateboard ramp with one doing a trick. Little girl looking down at leaves with her bicycle parked next to her. A woman in a dress riding a bicycle.

A group of people doing security on horseback. There are many eager participants at the skateboard ramp and skate park. An old fire hydrant stands on a stone street next to a porch. A man in a hat and tie standing on a porch. A woman in the kitchen with lots of food.

A group of people riding kite boards on top of the ocean. two people riding skate boards on a city street A couple of people holding snow boards in the snow. A couple of people rowing a boat across a lake. A skateboarder tries to jump while performing a trip.

Two mountain bikers take a break on a path. A ""LAN" Brand airplane at an airport near the sea. A girl in dress sitting on a park bench. People walk outside with umbrellas, two men do not have umbrellas. A group of young people loading their luggage into a bus.

Person, Car, Skateboard, ChairCor. Tags

A wet rain soaked street below a very tall building A person reaching through a wall to grab something. A man is pictured with a grin upwards as he repairs a toilet. A child in a red jacket with a frisbee on his head A man in sunglasses is playing with a yellow frisbee.

A pretty girl walking down a rain soaked street carrying umbrellas. two people reaching for a frisbee in a park A couple of people riding on the backs of horses. A smiling skier posing on a snow covered hill. Two people are playing the wii, having a blast.

a person riding a skate board at a skate park Two people on skate boards with helmets and knee pads on. A boy at a skate park attempts to do tricks on a skateboard. a skate boarder doing a skate boarding trick on the sidewalk. A couple of young guys riding skateboards down a small path.

Figure 5: Retrieval examples of ’Image to Text’. Red dot denotes ir-relevant; blue denotes sharing one tag with the query, green denotessharing two tags with the query at least.

+ image- image+ text- text

class 1 +image-image+text-text

class 2

+image-image+text-text

class 3 + image- image+ text- text

class 4

+image-image+text-tetx

class 5 +image-image+text-text

class 6

Figure 6: The visualization of binary-like codes of testing image andtext learned by RMSH on the NUS-WIDE (six classes) by t-SNE.

Besides, Fig. 8 shows the distribution of the Hamming dis-tance of learned codes by RMSH. We observe that 1) toosmall δ makes the dissimilar points be nearer, which causesthe difficulties of keeping multilevel similarity structure ofthe similar points; 2) too large δ prefers to scatter dissimilarpoints in the whole Hamming space, which is cline to causefalse coding since the number of different codes that satisfythe distance constraint is limited.

Ablation Study.To analyze the effectiveness of different loss terms and thepseudo-codes network in the proposed RMSH method, weseparately remove: the classification loss term Lcl, the tripletloss term Ltr and the pseudo-codes with others remained toevaluate their influence on the final performance. These threemodels are called RMSH-C, RMSH-T, and RMSH-P. Table2 shows the result. We see that separately removing them willdamage the retrieval performance to varying degrees. No-tably, 1) the performance gap between RMSH and RMSH-T enlarges with the code length increasing on the MIR-FLICKR25K dataset, which confirms the importance of themargin-adaptive triplet loss for preserving the fine-grainedsimilarities of multi-label data in larger Hamming space. 2)As the code length increases on the MIRFLICKR25K andNUS-WIDE datasets, the triplet loss term shows more in-fluence on the final performance than the classification lossterm, which indicates that the exploitation of ranking infor-mation is more critical than the discriminative informationto capture similarity structure. 3) Because the similarity in-formation on the MS COCO dataset (80 concepts) is moresparse, the improvement introduced by the strategy of thepseudo-codes is more significant, i.e., the difference betweenRMSH and RMSH-P.

4.6 Parameter Analysis.In this section, we conducted the experiments to analyze theparameter sensitive of RMSH on three datasets by varyingthe one parameter while fixing others. There are mainlyfour hyper-parameters in RMSH, i.e., λ1, λ2, λ3, and λ4.

Page 9: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

4 7 10 16 20 32 41 43 64 86Margin

0.42

0.44

0.46

0.48

0.5

0.52P

erfo

rman

ce (

ND

CG

@50

0 )

Image to TextText to Image LowerBound (p=0.9)UpperBound (H=12.9)

4 7 9 16 20 32 43 47 64 86Margin

0.56

0.58

0.6

0.62

0.64

0.66

Per

form

ance

( N

DC

G@

500

)

Image to TextText to Image LowerBound (p=0.99) UpperBound (H=6.2)

4 7 9 16 20 32 38 43 64 86Margin

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

Per

form

ance

( N

DC

G@

500

)

Image to TextText to Image LowerBound (p=0.9) UpperBound (H=15.7)

@MIFLICKR25K @NUS-WIDE @MS COCO

Figure 7: The impact of different robust parameter δ in the RMSH method on MIRFLICKR25K, NUS-WIDE, and MS COCO datasets. Thelower and upper bounds of effective δ are marked. The code length is 128.

Table 2: Ablation study of RMSH in terms of NDCG@500 scores on MIRFLICKR25K, NUS-WIDE, and MS COCO datasets. The bestresult is shown in boldface.

Task Method MIRFLICKR25K NUS-WIDE MS COCO16bits 32bits 64bits 128bits 16bits 32bits 64bits 128bits 16bits 32bits 64bits 128bits

I→ T

RMSH-T 0.4053 0.4202 0.4524 0.4378 0.5437 0.5894 0.5894 0.5885 0.2759 0.3147 0.3430 0.3303RMSH-C 0.4066 0.4143 0.4331 0.4597 0.5946 0.6071 0.6383 0.6593 0.1592 0.1811 0.2038 0.2218RMSH-P 0.4290 0.4598 0.4920 0.5082 0.6042 0.6235 0.6248 0.6492 0.2918 0.3253 0.3331 0.3155RMSH 0.4592 0.4940 0.5180 0.5183 0.6148 0.6401 0.6497 0.6574 0.3037 0.3724 0.3855 0.3998

T→ I

RMSH-T 0.3905 0.4169 0.4061 0.3677 0.5369 0.5560 0.5467 0.5583 0.2575 0.2725 0.2978 0.3079RMSH-C 0.3428 0.3696 0.3809 0.4071 0.5467 0.5502 0.5803 0.6069 0.1625 0.1714 0.1868 0.1983RMSH-P 0.3912 0.4177 0.4226 0.4453 0.5580 0.5725 0.5723 0.5823 0.2587 0.2747 0.2774 0.2742RMSH 0.4379 0.4533 0.4562 0.4628 0.5744 0.5921 0.5900 0.5999 0.2897 0.3080 0.3050 0.3175

=4,NDCG@500=0.4669

10 15 20 25 30 35 40 45Hamming distance

0

500

1000

P(dH|S>0)

P(dH|S=0)

=7, NDCG@500=0.5053

10 15 20 25 30 35 40 45Hamming distance

0

500

1000

P(dH|S>0)

P(dH|S=0)

=16, NDCG@500=0.5258

10 15 20 25 30 35 40 45 50 55 60 65Hamming distance

0

500

1000

P(dH|S>0)

P(dH|S=0)

=20, NDCG@500=0.5268

10 20 30 40 50 60 70Hamming distance

0

500

1000

P(dH|S>0)

P(dH|S=0)

=32, NDCG@500=0.5193

10 20 30 40 50 60 70 80 90Hamming distance

0

500

1000

P(dH|S>0)

P(dH|S=0) =43, NDCG@500=0.5106

10 20 30 40 50 60 70 80 90Hamming distance

0

500

1000

P(dH|S>0)

P(dH|S=0)

=64, NDCG@500=0.4944

10 20 30 40 50 60 70 80 90 100Hamming distance

0

500

1000

P(dH|S>0)

P(dH|S=0)

=86, NDCG@500=0.4704

10 20 30 40 50 60 70 80 90 100 110Hamming distance

0

500

1000

P(dH|S>0)

P(dH|S=0)

Figure 8: The conditional distributions (averaged on the MIR-FLICKR25K image test set ) of Hamming distance between testingimage codes and training text codes when given semantic similarity,after optimizing RMSH with different robust parameter δ.

10-4 10-3 10-2 10-1 1 101

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ND

CG

@50

0

MIRFLICKR25K NUS-WIDE MS COCO

—— Image to Text------ Text to Image

10-4 10-3 10-2 10-1 1 102

0.2

0.3

0.4

0.5

0.6

0.7

ND

CG

@50

0

MIRFLICKR25K NUS-WIDE MS COCO

—— Image to Text------ Text to Image

10-4 10-3 10-2 10-1 1 103

0.2

0.3

0.4

0.5

0.6

0.7

ND

CG

@50

0

MIRFLICKR25K NUS-WIDE MS COCO

—— Image to Text------ Text to Image

10-4 10-3 10-2 10-1 1 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ND

CG

@50

0

MIRFLICKR25KNUS-WIDEMS COCO

—— Image to Text------ Text to Image

Figure 9: Parameter analysis of λ1, λ2, λ3 and λ4 on MIR-FLICKR25K, NUS-WIDE, and COCO dataset.

Page 10: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

The NDCG@500 values with different settings of hyper-parameters in the case of 64 bits on two cross-modal retrievaltasks are summarized in Figure 9. From the figure, 1) weobserve that the NDCG scores first obtain a slight increaseand then degrades as λ1 and λ2 becomes larger, respectively.This indicates that proper λ1 and λ2 setting is helpful topreserve intra-modal and inter-modal similarities simultane-ously, a huge gap between the value of λ1 and λ2 will breakthis balance and hurt the final performance. 2)The perfor-mance emerges decrease when λ3 gets large since weighingmore the classification loss of pseudo-codes will introducemore noise to the optimization process. 3) The result of λ4

is similar to that of λ1. A suitable λ4 is useful to reduce thequantization loss, but a large λ4 will weaken the goal of pre-serving similarity.

5 ConclusionWe have presented a novel robust multilevel semantic hash-ing for multi-label cross-modal retrieval. The approach pre-serves the multilevel semantic similarity of data and explic-itly ensures the distance between different codes is larger thana specific value for robustness to the modality discrepancy.Mainly, we give an effective range of this value from the in-formation coding theory analysis and characterize the abovegoal as a margin-adaptive triplet loss. We further introducea pseudo-codes network for the imbalanced semantics. Ourapproach yields the state-of-the-art empirical results on threebenchmarks.

AcknowledgementsThis work is partially supported by National Science Founda-tion of China (61976115,61672280, 61732006), AI+ Projectof NUAA(NZ2020012,56XZA18009), China ScholarshipCouncil(201906830057).

References[1] D. Cao, Z. Yu, H. Zhang, J. Fang, L. Nie, Q. Tian,

Video-based cross-modal recipe retrieval, in: Proc.ACM MM, 2019, pp. 1685–1693.

[2] Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang,J. Shao, Camp: Cross-modal adaptive message pass-ing for text-image retrieval, in: Proc. ICCV, 2019, pp.5764–5773.

[3] A. Dutta, Z. Akata, Semantically tied paired cycle con-sistency for zero-shot sketch-based image retrieval, in:Proc. CVPR, 2019, pp. 5089–5098.

[4] B. Zhu, C.-W. Ngo, J. Chen, Y. Hao, R2gan: Cross-modal recipe retrieval with generative adversarial net-work, in: Proc. CVPR, 2019, pp. 11477–11486.

[5] T. Baltrusaitis, C. Ahuja, L. Morency, Multimodal ma-chine learning: A survey and taxonomy, IEEE Trans.Pattern Anal. Mach. Intell. 41 (2) (2019) 423–443.

[6] R. Xu, C. Li, J. Yan, C. Deng, X. Liu, Graph convo-lutional network hashing for cross-modal retrieval, in:Proc. IJCAI, 2019, pp. 982–988.

[7] L. Zhen, P. Hu, X. Wang, D. Peng, Deep super-vised cross-modal retrieval, in: Proc. CVPR, 2019, pp.10394–10403.

[8] Y. Zhan, J. Yu, Z. Yu, R. Zhang, D. Tao, Q. Tian, Com-prehensive distance-preserving autoencoders for cross-modal retrieval, in: Proc. ACM MM, 2018, pp. 1137–1145.

[9] H. Liu, M. Lin, S. Zhang, Y. Wu, F. Huang, R. Ji,Dense auto-encoder hashing for robust cross-modalityretrieval, in: Proc. ACM MM, 2018, pp. 1589–1597.

[10] Y. Shi, X. You, F. Zheng, S. Wang, Q. Peng, Equally-guided discriminative hashing for cross-modal retrieval,in: Proc. IJCAI, 2019, pp. 4767–4773.

[11] Q. Jiang, W. Li, Deep cross-modal hashing, in: Proc.CVPR, 2017, pp. 3270–3278.

[12] J. Wang, T. Zhang, J. Song, N. Sebe, H. T. Shen, Asurvey on learning to hash, IEEE Trans. Pattern Anal.Mach. Intell. 40 (4) (2018) 769–790.

[13] P. Hu, X. Wang, L. Zhen, D. Peng, Separated variationalhashing networks for cross-modal retrieval, in: Proc.ACM MM, 2019, pp. 1721–1729.

[14] Y. Cao, B. Liu, M. Long, J. Wang, Cross-modal ham-ming hashing, in: Proc. ECCV, 2018, pp. 202–218.

[15] X. Zhang, H. Lai, J. Feng, Attention-aware deep adver-sarial hashing for cross-modal retrieval, in: Proc. ECCV,2018, pp. 591–606.

[16] Z.-D. Chen, Y. Wang, H.-Q. Li, X. Luo, L. Nie, X.-S.Xu, A two-step cross-modal hashing by exploiting labelcorrelations and preserving similarity in both steps, in:Proc. ACM MM, 2019, pp. 1694–1702.

[17] X. Lu, L. Zhu, Z. Cheng, J. Li, X. Nie, H. Zhang, Flexi-ble online multi-modal hashing for large-scale multime-dia retrieval, in: Proc. ACM MM, 2019, pp. 1129–1137.

[18] C. Li, C. Deng, L. Wang, D. Xie, X. Liu, Coupled cy-clegan: Unsupervised hashing network for cross-modalretrieval, in: Proc. AAAI, Vol. 33, 2019, pp. 176–183.

[19] S. Su, Z. Zhong, C. Zhang, Deep joint-semantics re-constructing hashing for large-scale unsupervised cross-modal retrieval, in: Proc. ICCV, 2019, pp. 3027–3035.

[20] G. Ding, Y. Guo, J. Zhou, Collective matrix factor-ization hashing for multimodal data, in: Proc. CVPR,2014, pp. 2083–2090.

[21] D. Zhang, W. Li, Large-scale supervised multimodalhashing with semantic correlation maximization, in:Proc. AAAI, 2014, pp. 2177–2183.

[22] Z. Lin, G. Ding, M. Hu, J. Wang, Semantics-preservinghashing for cross-view retrieval, in: Proc. CVPR, 2015,pp. 3864–3872.

[23] V. E. Liong, J. Lu, Y. Tan, J. Zhou, Cross-modal deepvariational hashing, in: Proc. ICCV, 2017, pp. 4097–4105.

[24] C. Li, C. Deng, N. Li, W. Liu, X. Gao, D. Tao,Self-supervised adversarial hashing networks for cross-modal retrieval, in: Proc. CVPR, 2018, pp. 4242–4251.

Page 11: arXiv:2002.02698v1 [cs.CV] 7 Feb 2020 · arXiv:2002.02698v1 [cs.CV] 7 Feb 2020. or equal than 3, i.e., the ’red’ and ’green’ semantics, and no conflict happens. Intuitively,

[25] V. Guruswami, A. Rudra, M. Sudan, Essential cod-ing theory, Draft available at http://www. cse. buffalo.edu/ atri/courses/coding-theory/book.

[26] H. J. Helgert, R. D. Stinaff, Minimum-distance boundsfor binary linear codes, IEEE Trans. Information Theory19 (3) (1973) 344–356.

[27] E. Agrell, A. Vardy, K. Zeger, A table of upper boundsfor binary codes, IEEE Trans. Information Theory47 (7) (2001) 3004–3006.

[28] E. Bellini, E. Guerrini, M. Sala, Some bounds on thesize of codes, IEEE Trans. Information Theory 60 (3)(2014) 1475–1480.

[29] E. N. Gilbert, A comparison of signalling alphabets, TheBell system technical journal 31 (3) (1952) 504–522.

[30] A. Alfassy, L. Karlinsky, A. Aides, J. Shtok, S. Harary,R. S. Feris, R. Giryes, A. M. Bronstein, Laso: Label-setoperations networks for multi-label few-shot learning,in: Proc. CVPR, 2019, pp. 6548–6557.

[31] D. Wu, Z. Lin, B. Li, M. Ye, W. Wang, Deep supervisedhashing for multi-label and large-scale image retrieval,in: Proc. ICMR, 2017, pp. 150–158.

[32] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona,D. Ramanan, P. Dollar, C. L. Zitnick, Microsoft coco:Common objects in context, in: Proc. ECCV, 2014, pp.740–755.

[33] M. J. Huiskes, M. S. Lew, The MIR flickr retrievalevaluation, in: Proceedings of the 1st ACM SIGMMInternational Conference on Multimedia InformationRetrieval, MIR 2008, Vancouver, British Columbia,Canada, October 30-31, 2008, 2008, pp. 39–43.

[34] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.-T.Zheng, Nus-wide: A real-world web image databasefrom national university of singapore, in: Proc. of ACMConf. on Image and Video Retrieval (CIVR’09), July8-10, 2009.

[35] K. Jarvelin, J. Kekalainen, Ir evaluation methods forretrieving highly relevant documents, in: InternationalACM SIGIR Conference on Research and Developmentin Information Retrieval, 2000, pp. 41–48.

[36] F. Zhao, Y. Huang, L. Wang, T. Tan, Deep semanticranking based hashing for multi-label image retrieval,in: Proc. CVPR, 2015, pp. 1556–1564.

[37] C. Deng, Z. Chen, X. Liu, X. Gao, D. Tao, Triplet-baseddeep hashing network for cross-modal retrieval, IEEETrans. Image Processing 27 (8) (2018) 3893–3903.

[38] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learningfor image recognition, in: Proc. CVPR, 2016, pp. 770–778.

[39] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, R. Ur-tasun, A. Torralba, S. Fidler, Skip-thought vectors, in:Proc. NeuraIPS, 2015, pp. 3294–3302.