A Deep Multi-Modal Pairwise Ranking Model for User Generated …leehyun/08508250.pdf · 2019-03-24 · 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis

2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)

A Deep Multi-Modal Pairwise Ranking Model forUser Generated Food Data

Hesam Salehian∗, Surender Yerva†, Iman Barjasteh‡, Patrick Howell§ and Chul Lee¶

Under Armour Inc., 135 Townsend St, San Francisco, CA, USA∗[email protected], †[email protected], ‡[email protected],

§[email protected], ¶[email protected]

Abstract—Due to the emergence of several nutrition-relatedmobile applications and websites in recent years, as well asthe massive amount of crowd-sourced nutrition data, searchingand finding relevant results has become increasingly difficultfor users. This problem becomes even more challenging whendealing with crowd-sourced food names that are noisy and notwell-structured . Because food names are short in length, it isdifficult to incorporate existing methods to achieve an optimalmatching quality. Despite several recent studies on nutritiondata, these challenges remain. In this paper, we propose a novellearning-to-rank framework for crowd-sourced food names thathas significant real-world applications, including food search andfood recommendations. In particular, we propose a deep learningbased, multi-modal learning-to-rank model that leverages the textdescribing a food name and the numerical values that representits nutritional information. To this end, we also introduce anovel type of loss-function, which extends standard tripletshinge loss function into a multi-modal scenario. The proposedmodel is flexible and supports various data types as well as anarbitrary number of modalities. The effectiveness of our proposedmodel is demonstrated through several experiments on real-data,consisting of more than six million instances.

I. INTRODUCTION

MYFITNESSPAL (MFP) is a free health and fitness app

available in Android, iOS, and web formats that helps

people set and achieve personalized health goals through the

tracking of their nutrition and physical activity. In fact, MFP,

used by 100 million+ users, is consistently ranked at the top

of the health and fitness category in both the Apple App Store

and in Google Play. By tracking health and nutrition, MFP

enables users to gain insights that help them make smarter

choices and build healthier habits. Upon setting a personal

fitness goal, users can visually inspect their fitness and weight

loss progress and receive insights and guidance that may help

them reach their health goals. MFP also operates its own social

network, social feeds, blogging platform, and user forums.

MFP’s food data – namely, nutritional contents and the food

descriptions – are one of the main draws of the app, and are

sourced almost entirely via users inputs. While MFP’s database

carries no guarantees of nutritional accuracy due to its reliance

on crowd-sourcing, the great popularity of the app partially

speaks of the quality of its food DB that consists of over

hundreds of millions of food items, aggregated over course of

many years, and tens of billions of individual food log entries.

One crucial component for unlocking such a large database

of 100 million+ foods, is the ability to search it for relevant

results; in our case, a user inputs a text string and gets back a

list of food objects that contain food brand, food description

and nutritional information.

One natural problem that arises during food search is how

to retrieve the most relevant food objects given the text string

a user has entered. As an example, user inputs “orange” as

the query, the result set contains a wide range of food entities

containing “orange” term, including fruits, juices and desserts,

with different nutritional information. A goal then is to increase

the relevance of search results by factoring in other available

contexts, or, signals learned from real user behavior. Note

that this sort of problem has been widely studied in the field

of learning to rank in various search ranking scenarios. Our

particular learning to rank problem differs from the previously

studied instances and possesses some unique challenges that

have not been seen before. This is mainly due to peculiarity and

nature of our food data. As mentioned before, our food data

is collected via crowd-sourcing from over a hundred million

users over the course of many years. It contains text describing

each food name, and a real-value vector that corresponds to

the nutrients. As a consequence, our food data is not fully

structured, sanitized or well organized, making our ranking

problem not that straightforward. Furthermore, food names are

short in length, hence the presence/absence of a single word, or

the word ordering in the given food name would significantly

distort its semantics, making our ranking problem harder than

a traditional learning to rank problem setting.

To illustrate some of the challenges in our ranking problem,

some examples are provided in Table I. In the first two

examples, the presence of the words "pie" and "sorbet", makes

the corresponding foods irrelevant to the specified queries

"apple" and "orange", respectively, while "Fuji apple" and

"large orange" are probably more semantically relevant. The

third example is more complicated in a sense that the relevant

and irrelevant food names are actually comprised of the exact

same set of words, but subtle differences in word orderingIEEE/ACM ASONAM 2018, August 28-31, 2018, Barcelona, Spain978-1-5386-6051-5/18/$31.00 c© 2018 IEEE

503

Table ISAMPLE FOOD SEARCH RESULTS FOR POPULAR QUERIES. (CALORIE VALUES ARE 1-GRAM NORMALIZED)

Query String Relevant Food Name Relevant Food Calories Irrelevant Food Name Irrelevant Food Calories)

apple Fuji - Apple 0.50 Pie - Apple 2.37orange Orange - Large 0.47 Orange - Sorbet 1.32

spaghetti Spaghetti with meat sauce 1.57 Spaghetti sauce with meat 3.62

makes "spaghetti with meat sauce" relevant, and "spaghetti

sauce with meat" irrelevant to this particular query. It should

be noted that in all three examples, the nutritional contents

can provide an important contextual clue to make the correct

prediction of user intent.

Given this observation, to overcome the complexities of food

naming conventions in text, we decided to exploit nutrition

information of our food objects and have them incorporated

as parts of our ranking framework. For instance, for the

query = "apple", the foods "Fuji - apple" and "pie - apple"

may be evaluated as similar in name, but very different

in nutritional contents (0.5 and 2.37 calories per 1 gram,

respectively). Therefore, an effective ranking model should

take the nutrition information into account, along with the

text’s semantic features, in an intelligent manner.

To this end, we propose a new approach to a multi-modal

deep learning to rank framework to incorporate different

modalities like nutrition information as well as food names

into account. Our deep multi-modal approach is novel since

it works well for a pair-wise learning to rank problem, as

compared to many similar models that have been proposed

previously, which have mainly addressed point-wise ranking

problems due to expensive computation needs. As part of our

model, we propose a novel triplet loss function that is suitable

for our pair-wise multi-modal ranking setting, but may also

prove relevant to other forms of multi-modal learning to rank

scenarios. The benefits of the flexibility of this triplet loss

function can be observed in a few different ways. First, it

can be easily extended to handle more than two modalities by

simply adding more term variables to the given loss function.

Second, we believe that our novel multi-modal model can be

used in other problem scenarios, including but not limited to

text and image data, where a unified loss function is needed

to combine vectors extracted from different modalities. Third,

our novel objective function allows us to use different sizes

for the embedded vectors with different geometric properties,

from different modalities, while existing models simply assume

almost the same vector lengths for different data parts and

concatenate them into a single feature vector. This is crucial for

the cases where some modalities are naturally more complicated

than others, e.g., nutrition vector in our problem only has 4 real-

value components, whereas the complexity of text data demands

much larger embedding vector sizes. Therefore, using simple

concatenation on the nutrition and (much larger) embedded text

vectors would make the small-dimensional nutrition modality

negligible in the final prediction. Finally, the proposed loss

function also enables the researcher to use different distance

functions for different modalities, unlike existing models where

all embedded vectors are concatenated and treated as in the

same space, prior to the distance computation.

II. RELATED WORK

Classic learn to rank techniques often compute low-level

features to represent text data, including but not limited to TF-

IDF and BM25 [1], [2]. The ranking function is then learned

through standard machine learning techniques. The success of

these models relies heavily on the choice of features, which

often means that they require extensive feature engineering.

The advent of deep learning [3] has significantly empowered

automated and intelligent feature extraction, when sufficiently

large training data is available. Convolutional Neural Networks

(CNNs), originally proposed to extract image features [4],

have attracted enormous attention in recent years to address

ranking problems. In [5], [6], [7], CNN models, which are

good at spatial invariance, are employed to represent query and

candidate images, and the ranking is learned via pair-wise or

point-wise techniques. Closer to our domain, CNNs, specifically

1D convolution models, have also been used to learn text

features. In these settings, individual words or characters are

first represented as vectors, then form a matrix representing the

entire text. CNNs have been used for text matching/ranking,

in conjunction with word-level [8], [9], [10], [11], or with

character-level embeddings [12].

Even more recently, deep Recurrent Neural Networks (RNNs)

have shown better performance compared to CNNs, when

applied to text data. It is mainly due to the fact that text data

is usually sequential in nature, and a RNN can better model

temporal dependencies [13], [14]. In particular, Long Short

Term Memory (LSTM) deep learning models are one of the

most commonly used type of RNNs, successfully applied to text

ranking [15], machine translation [16] and language modeling

[17]. Most of the existing learn to rank techniques consider

only one type of data, e.g., image or text. In many applications,

it is desired to take data from different modalities into account

to make more accurate predictions. There are several deep

learning models presented in the literature to learn semantics

from different data modalities. In most of these techniques,

image and text features are embedded into the same embedding

space via CNN or RNN models, respectively, based on their

semantic relations. In the recent literature, this multi-modal

feature embedding approach has been used to address image

multi-labelling problem [18], [19] and image captioning [20],

[14].

However, very few works have addressed the problem of

multi-modal ranking. In [15] a pair-wise multi-modal ranking

model is learned on image and text data. The ranking framework

504

is a means to learn semantic relations between image and

text instances, while the ultimate goal is to generate relevant

captions for a given image, hence, like most of the existing

techniques, the image and text embedded vectors are mapped to

the same space. Consequently, the model in [15] is not directly

applicable to our multi-modal data consisting of nutrition

and text, because: (1) given a query text and set of multi-

modal candidates, we aim to learn the best ranking, and direct

estimation of the query’s nutritional contents is not desired,

(2) the semantic relation between text and nutrition data is

also not one-to-one, e.g., very different foods might still have

similar, or in some cases identical, nutritional contents. From

this perspective, the model in [21] seems more relevant to

our work, where a pair-wise multi-modal ranking is learned

on image and text data. In this model, the image and text

embedding spaces are separated, which is closer to our use

case, but our model is still fundamentally different from various

standpoints: (1) in [21] the text data is embedded using bag

of words features, while our model uses LSTM networks to

embed food names, which provides more intelligence, and

requires very little pre-processing, (2) in [21] the text and

image embedded vectors are concatenated, before feeding into

the ranking layer, while we keep our two modalities separated

all the way through, which is more theoretically sound. To

this end, we proposed a novel triplets hinge loss function

which takes embedded vectors from multiple modalities. The

theoretical benefits of our approach is discussed in more details

in the next section.

III. OUR MODEL

In this section, we describe our proposed deep learning

model to rank the matching of candidate pairs in a multi-

modal fashion. Similar to other pairwise ranking methods, we

are given a query text and two candidate foods, and we aim

to rank one candidate higher than the other based on match

quality. Each food candidate in our model is represented as a

multi-modal entity, consisting of: (1) text component and (2)

nutrition component. The aim of our model is to determine the

matching relevance taking consideration of both modalities.

A. Distance Function for Nutrition Vectors

Nutritional content is essential data modality, when it comes

down to food object comparison. Nutritional content can

be broadly divided to two categories: macro-nutrients and

micro-nutrients. We first introduce a novel geometric model to

represent nutritional content vectors. For simplicity and without

any loss of generality, we consider the 1× 4 macro-nutrients

vector (consisting of total energy, fat, carbs and protein) as

the vector representation of nutrition content. For the sake of

handling food objects with different serving sizes properly,

we normalize each macro-nutrient per 1-gram, for each food.

The 1× 4 vector of macro-nutrients, represented henceforth by

[e, f, c, p] corresponds to total energy (measured in calories),

fat, carbs and protein, respectively. This vector satisfies the

well-known constraint of e = 9×f+4×c+4×p [22]. Hence,

the contribution of each macro-nutrient towards the total energy

can be measured by: f ′ = 9×fe

, c′ = 4×ce

, p′ = 4×pe

, hence

f ′+ c′+ p′ = 1. Any nutritional content vector, [e, f, c, p], can

be decomposed into two components: (1) total energy, e, and (2)

normalized vector of macros, [f ′, c′, p′]. Note that total energy

is a positive value, i.e. e ∈ R+ while square root density vector

[23], i.e. M = [√f ′,

√c′,

√p′], belongs to two-dimensional

sphere, S2, since∑3

i=1Mi

2 = 1. Thus, any nutritional content

vector, [e, f, c, p], can be parameterized as [e]×[√f ′,

√c′,

√p′],

belonging to the R+×S

2 product space. Given two nutritional

content vectors N1 = [e1, f1, c1, p1] and N2 = [e2, f2, c2, p2],intrinsic distance function on this product space can be com-

puted as dist2nut(N1, N2) = dist2R+(e1, e2)+dist2

S2(M1,M2),

where Mi = [√

9×fiei

,√

4×ciei

,√

4×pi

ei], i = 1, 2 [24]. The

second term corresponds to the intrinsic distance function on

sphere which is computed as distS2(M1,M2) = cos−1(<M1,M2 >), where < . > is the vector inner product operator

[23]. Note that R+ is equivalent to the space of 1×1 Symmetric

Positive Definite (SPD) matrices. Thus, its intrinsic distance

is defined as dist2R+(e1, e2) = (Log( e1

e2))2 [25]. In summary,

given Ni, Mi and ei defined as above, we have

dist2nut(N1, N2) = [cos−1(< M1,M2 >)]2 + [Log(e1

e2)]2

(1)

B. Multi-Modal Triplets Hinge Loss

The idea of using a triplets hinge loss function for pairwise

ranking problems is not novel [26], [27]. However, we find

that a direct application of a typical triplets hinge loss function

for our problem setting is not feasible since our modalities

may require different embeddings. Thus, our proposed solution

is designed in such way that the loss function is capable of

taking multiple types of data into account, while preserving

their individual geometric properties. A starting point is to first

look at the classic single-modal hinge loss function, which can

be used for a simple text-only model. More specifically, given

a query text, Q, and a pair of candidates, P and N , labeled as

positive and negative respectively, the hinge loss is defined as:

L(Q,P,N) = max{0, γ + dist2txt(fq(Q), f(P ))

−dist2txt(fq(Q), f(N))}(2)

where γ is the gap parameter which governs the separation level

between positive and negative instances. fq(.) and f(.) are text

embedding functions for query and candidates, respectively,

which transform text inputs to their respective m-dimensional

feature vectors (i.e. fq(Q), f(P ) and f(N) ∈ Rm). In theory,

the query embedding function (fq) can be different from

the candidates function (f ), but in practice they are set to

be identical. Furthermore, disttxt(.) is the distance function

defined on the text embedding space. Either Euclidean (L2)

distance or Manhattan (L1) distance are standard distance

functions that have been used in many practical applications.

We start by illustrating the limitation of existing models when

we have to embed two modalities with different geometrical

spaces. The authors in [26] applied the hinge loss function

in Eq. 2 to the problem of learning image similarity, where

505

the non-linear embedding function f(.) is learned in a deep

learning framework. Their model is somewhat limited since it

is only applied to a single modality. In [27] a unified framework

is proposed to learn on both image and text embeddings

using a similar hinge loss function as that of [26]. In their

model, though, the original distance function is now replaced

with cosine similarity by swapping signs of positive and

negative terms (i.e. Eq. (6) in [27]). Although their model

takes into account two different modalities (i.e. image and

text), it simplifies the embedding of image and text modalities

by mapping them to the same space. Thus, the loss function

used in their model is still not applicable for two wholly

separated modalities that correspond to distinct embedding

spaces like ours. For their image caption generation problem,

where direct comparison between image and text is necessary,

it is natural to expect that using only one type of embedding

would be sufficient. In our application scenario, using only one

type of embedding would potentially limit or hinder important

information conveyed in one or the other dimension. Therefore

our proposed model aims to learn the relevance of the given

food candidate to an input query text, taking into account some

additional signals like nutritional content to improve the overall

accuracy of match ranking. Unlike the image caption generation

case, the mapping between food name to the nutritional contents

is not one-to-one, given that similar nutrition vectors can be

mapped to very distinct names due to the semantic differences.

We now present a novel triplets hinge loss function which

is an extension of Eq. 2 to the multi-modal case. Let Q

still represent the input query text, P = {Ptxt, Pnut} and

N = {Ntxt, Nnut} be the multi-modal positive and neg-

ative food candidates, respectively, where Ptxt, Ntxt both

represent the text information, and Pnut, Nnut be associated

nutrition components. The text and nutrition vectors are

of m− and n− lengths, respectively. In order to compare

foods of different quantities without any loss of generality,

each nutrition vector contains calories, fat, carbs and protein

amounts per 1-gram of the given food and hence n = 4.

For example, for butter, we have Pnut = [7.35, 0.8, 0, 0] (i.e.

7.35 calories/gram and 0.8 fat/gram) while for banana, we

have Pnut = [0.89, 0.003, 0.23, 0.011] (i.e. 0.89 calories/gram,

0.003 fat/gram, 0.23 carbs/gram and 0.011 protein/gram). Let

f(·) (fq(·) for the query input) be the unknown embedding

function that transforms the given input text into a m− di-

mensional space. Let g(.) be the unknown embedding function

that transforms the given input query text into n−dimensional

space, representing the learned nutritional content space.

Therefore, g(Q), Pnut and Nnut are all nutritional content

vectors in this embedded space of Rn. Formally, our pair-

wise multi-modal ranking problem can now be formulated by

using the following three pairs: (fq(Q), g(Q)), (f(Ptxt), Pnut),and (f(Ntxt), Nnut). It was shown in previous section that

the 1 × 4 nutrition vector belongs to the product space of

R+ × S

2. Hence, each pair corresponding to text and nutrition

(T,N) is a vector in the product space, Rm × R

+ × S2.

Accordingly, the distance function in this product space is

defined as [24]: dist2((T1, N1), (T2, N2)) = dist2txt(T1, T2) +

dist2nut(N1, N2) , where (Ti, Ni) for i = 1, 2 are two vectors

in the product space of text and nutrition content, while disttxtand distnut are distance functions defined on text and nutrition

spaces respectively.

Note that linearity of our distance function allows that the

distance function can be decomposed into text-based component

and nutrition-based component. More importantly, we can

use different distance metric for each component (e.g., L2distance for the text part, and Eq. 1 for the nutrition part).

Such flexibility with distance function is important in practice

since the separability of two objects with respect to one

single modality may not necessarily be suitable for that with

respect to other modality. This is in contrast with other similar

work [28] where all embedded vectors a and b are simply

concatenated to form a flattened vector [a1, ..., aj , b1, ..., bk]in the higher dimension space of size j + k. Furthermore,

this multi-modal distance function enables us to use intrinsic

distance corresponding to each data modality. For instance, the

nutrition vectors naturally belong to R+×S

2, where the intrinsic

distance is defined in Eq. 1. While, simple concatenation of

vectors will enforce us to use a single/non-intrinsic distance

function among all modalities, which does not preserve the

geometric properties of the data. Using the distance function

on the product space, we can now define the triplet hinge loss

function in our multi-modal model as:

L(Q,Ptxt, Ntxt, Pnut, Nnut) = max{0, γ+[dist2txt(fq(Q), f(Ptxt)) + dist2nut(g(Q), Pnut)]

− [dist2txt(fq(Q), f(Ntxt)) + dist2nut(g(Q), Nnut)]}(3)

All non-linear embedding functions fq(.), f(.) and g(.) in this

equation are learned together via a deep learning model, which

will be discussed in the next section.

C. Deep Learning Model

Long Short-Term Memory (LSTM) is one of the most

popular Recurrent Neural Network architectures that have been

successfully applied to different text understanding problems

[29]. Although, convolution-based models have also been

widely used to solve problems related to short-text data [8], [12],

these are unable to capture various complexities associated with

our short and noisy food data. In this paper, therefore, we are

adapting an LSTM-based deep neural network to better handle

some nuances that might occur with word ordering of food

names – one simple illustration of the importance of food name

order can be observed by comparing foods like “chocolate milk”

versus “milk chocolate”. Another distinguishing aspect of our

model is that unlike other works like [8], our model does not

require any extensive word embedding (e.g., Word2Vec [30])

to begin with. Instead, we use simple 1-hot vectors to represent

each word in the dictionary, as inputs to our deep learning

model. Our proposed model learns the word embeddings as

part of the training process. In the next section, we will show

the overall effectiveness of this proposed LSTM-based model

compared to a similar CNN-based architecture.

506

507

Table IIPAIRWISE RANKING ACCURACY. ALL MODELS TRAINED ON A SET OF 6.5M

TRIPLETS

Method Accuracy

Nutrition-Based LSTM 73.04Text-Based LSTM 82.16Multi-Modal CNN 91.96

Multi-Modal LSTM with Concat. Vecs 93.42Multi-Modal LSTM 94.48

• Multi-Modal CNN: A model similar to our model, but

where instead of LSTM layers, we use 1D-convolution

filters with width = 3.

• Text-Based LSTM: An LSTM based sub-model of our

full model in which only the text component is used. The

standard loss function in Eq. 2 with text vectors is applied.

• Nutrition-Based LSTM: An LSTM based sub-model of

our full model in which only nutrition content component

is used. The standard loss function in Eq. 2 with nutrition

content vectors is applied.

• Multi-Modal LSTM with Concatenated Vectors: A

model similar to our model with the only exception

that the embedded text and nutrition vectors are simply

concatenated and then the standard loss function in Eq. 2

is applied.

• Multi-Modal LSTM: Our proposed method.

All models have been implemented using Keras1, with a

Theano backend [31].

A. Training Data

Our training data consists of a set of triplets in the following

form: <query, relevant candidate, irrelevant candidate>, in

which query refers to a food search text string , relevant

(irrelevant) candidate refers to a food name candidate that

is relevant (irrelevant) to the given query. This training data

has been collected using randomly sampled food search logs

produced by hundreds of millions of food search activities in

MFP. Table III contains a summary of training data collection.

First, a set of food names that have frequently appeared

within the top 5 search results for the randomly selected

food search queries were retrieved. Then, the Click-Through

Ratio (CTR), r(F |Q), for each food F , given a query Q

is computed. Next, each pair (Q,F ) is labeled positive if

r(F |Q) > 0.2, or negative if r(F |Q) < 0.05. For each query,

Q, a set of triplets of the form (Q,P,N) were generated,

where P (N ) corresponds to positive (negative) foods. Last,

their corresponding gram-normalized nutritional content for all

candidates were retrieved. As a result, a training set of 6.5Mrandomly selected triplets was produced from food search logs.

B. Labeling of Triplets

In this section, we compare the accuracy of different models

on the task of triplets labeling. Our test set contains a set of

1https://keras.io/

Table IIITRAINING DATA COLLECTION SUMMARY.

Description Value

Number of sessions ~500MNumber of unique queries ~70MNumber of unique foods ~10M

Number of collected triplets 65K

triplets, consisting of a query string and two food candidates,

whose labels (positive/negative) are hidden for testing. Similar

to standard pairwise ranking test, each trained model is

supposed to assign a positive label to one of candidates, and

a negative label to other one. In order to do so, the distance

from the embedded query vector(s) to each of the candidates is

computed, and the candidate with the smaller distance is labeled

as positive. For multi-modal networks, the distance function in

product space of text and nutrition is computed between query

and all of its candidates. We used disttxt(x, y) =√

||x− y||2,

and distnut from Eq. 1 in all of experiments.

A small subset of our training dataset, consisting of 65krandomly selected instances, was withheld as the test set. The

same test set was used to evaluate the accuracy of all models. In

Table II, we show the results of such experiment. As expected,

it is evident from the given results that the model that is solely

based on nutritional content, i.e., Nutrition-Based LSTM, shows

the poorest performance among all 5 methods (73.04%). This

is not surprising at all due to the fact that nutrition information

is not a unique identifier of foods in general, since completely

different food items might have pretty similar nutrition content.

Next, Text-Based LSTM is able to reach a better accuracy of

82.16% but it still falls short of other advanced multi-modal

models. Once again, this is not surprising because learning

semantic relations from our crowd-sourced food DB of short

food names solely using text information is never sufficient as

has been previously pointed out. For instance, apple fuji and

apple pie might be treated as semantically close food objects

by using some standard similarity metrics (e.g. edit distance),

but obviously their nutritional content differ very significantly

and therefore these two food objects are in fact very different.

This clearly justifies why leveraging nutritional information to

build an efficient ML model for our application makes sense.

Among multi-modal approaches, Multi-Modal CNN does

a relatively good job of combining text and nutrition data to

some extent. However, it is unable to achieve the same level

accuracy as that of LSTM-based models. This might be due

to the fact that convolution based models rely mainly on word

adjacency, while our crowd-sourced food names require way

more robust way to capture and represent semantic relations

between these words. Finally, our proposed model in which the

geometric properties of the embedded text and nutrition vectors

are preserved is the clear winner by beating all other models

including the other Multi-Modal LSTM with concatenated

vectors where the embedded vectors are simply concatenated.

In Table IV, we also show some sample triplets, namely

508

Table IVSAMPLE TRIPLETS LABELED BY THREE DIFFERENT MODELS. THE GAP BETWEEN POSITIVE AND NEGATIVE INSTANCES (WHERE LARGER IS BETTER) IS

DEFINED AS g = dist(Q,N)− dist(Q,P ). NEGATIVE GAP VALUES IMPLY THAT THE LABELS WERE INCORRECTLY PREDICTED. Pnut AND Nnut

INDICATE THE NUTRITION VECTOR VALUES.

Query String Positive Candidate Negative Candidate Model dist(Q, P) dist(Q, N) Gap

AppleGeneric Fuji Apple

Pnut: [0.52, 0.01, 0.14, 0.01]Apple Strudel

Nnut: [2.74, 0.11, 0.42, 0.03]

Text-based LSTM 0.657 0.057 -0.600Multi-Modal CNN 0.800 1.004 +0.204

Multi-Modal LSTM 0.659 0.989 +0.330

Black PepperSpice Ground Black PepperPnut: [2.17, 0, 0.43, 0]

Graze Black Pepper PistachioNnut: [3.21, 0.32, 0.03, 0.10]

Text-Based LSTM 0.607 0.988 +0.381Multi-Modal CNN 0.941 0.939 -0.002

Multi-Modal LSTM 0.607 1.172 +0.565

Table VNDCG@10 SCORE (%)

Method "apple" "black pepper" "salt" "white flour" "pizza" Average over 30 queries

Text-based LSTM 83.21 83.85 43.38 52.45 93.44 88.90Multi-modal CNN 93.12 83.85 52.83 54.12 93.44 90.57

Multi-modal LSTM 100.0 90.60 58.31 56.92 94.24 92.72

“Apple” and “Black Pepper”, from the test set, along with their

corresponding distances between the given query and each

candidate, measured with respect to three different models:

(1) Text-Based LSTM, (2) Multi-Modal CNN and (3) Multi-

Modal LSTM (proposed). In the first example, “Apple”, Text-

Based LSTM clearly failed to assign the correct label to input

candidates. This is because, for instance, relative text based

distance between apple and apple strudel is much smaller than

text based distance between apple and generic fuji apple. In

contrast, Multi-Modal models were more successful in predict-

ing labels, clearly showing the power of leveraging multiple

modalities. Multi-Modal LSTM shows a larger separation value

(i.e., gap) between two given positive and negative candidates.

The gap value here is defined as the difference between

dist(Q,N) and dist(Q,P ), where dist(.) is the corresponding

distance function learned by each model. Therefore, a larger

gap value indicates that the distance function is more correct

in distinguishing differences between the targets. In the second

example, “Black Pepper”, all labels were correctly assigned by

Text-Based LSTM, while Multi-Modal CNN failed to do so. On

the other hand, our Multi-Modal LSTM was not only able to

predict the correct label, but it also increased the gap between

negative and positive instances by almost 20%. Both examples

clearly illustrate the overall superiority of our proposed model.

C. Food Search Ranking

In order to evaluate the performance of a learning to ranking

model, it is common practice to evaluate the given model in

a real search application setting. In this section, we compare

the performance of the following three models in a real-world

food search ranking setting: (1) Text-Based LSTM, (2) Multi-

Modal CNN and (3) Multi-Modal LSTM. For this experiment,

we used the top 10 food search results from the top 30 most

popular queries. Each food name was assigned a label between

0 and 5, based on the Click-Through-Ratio (CTR) observed in

user search log events; the 0 being completely irrelevant and

the 5 being completely relevant. For every food corresponding

to the given query, all embedded vectors from each model

were computed, and distance between the given query and the

food candidate was measured. All items were ranked in an

ascending order, with respect to its distance to the given query,

and finally Normalized Discounted Cumulative gain (NDCG)

score [32] was computed for each ranked set.

Table V summarizes the NDCG scores for this experiment.

Even for some challenging queries like “salt” and “white

flour,” it is evident that across all cases our proposed Multi-

Modal LSTM approach performs the best among all three

competing models. Furthermore, the rightmost column contains

the average NDCG score computed over all 30 queries, which

shows that the Multi-Modal LSTM model as the winner, once

again. This experiment further proves that our proposed model

works very well even for real-world food search applications.

V. CONCLUSION

In this paper a deep learning model is proposed to address

the pairwise multi-modal ranking problem. The main novelty of

the proposed approach is the extension of the standard triplets

hinge loss function to a multi-modal scenario. We applied

LSTM models in order to effectively embed the text inputs

into numerical feature vectors and then supplemented these

features with an additional modality of nutrition. The entire

model parameters are all trained at once, without extensive pre-

processing steps to compute the word embeddings. Although,

the proposed model is designed to address the problem of

foods ranking with text and nutrition, it can be readily applied

to other data types, with any arbitrary number of modalities.

REFERENCES

[1] L. Hang, “A short introduction to learning to rank,” IEICE TRANSAC-

TIONS on Information and Systems, vol. 94, no. 10, pp. 1854–1862,2011.

[2] H. Li, “Learning to rank for information retrieval and natural languageprocessing,” Synthesis Lectures on Human Language Technologies, vol. 7,no. 3, pp. 1–121, 2014.

[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, 1998.

509

[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural

information processing systems, 2012, pp. 1097–1105.[5] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen,

and Y. Wu, “Learning fine-grained image similarity with deep ranking,”in Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2014, pp. 1386–1393.[6] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic ranking based

hashing for multi-label image retrieval,” in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2015, pp.1556–1564.

[7] X. Zhao, X. Li, and Z. Zhang, “Multimedia retrieval via deep learningto rank,” IEEE Signal Processing Letters, vol. 22, no. 9, pp. 1487–1491,2015.

[8] A. Severyn and A. Moschitti, “Learning to rank short text pairswith convolutional deep neural networks,” in Proceedings of the 38th

International ACM SIGIR Conference on Research and Development in

Information Retrieval. ACM, 2015, pp. 373–382.[9] Z. Lu and H. Li, “A deep architecture for matching short texts,” in

Advances in Neural Information Processing Systems, 2013, pp. 1367–1375.

[10] L. Rigutini, T. Papini, M. Maggini, and M. Bianchini, “A neural networkapproach for learning object ranking,” in International Conference on

Artificial Neural Networks. Springer, 2008, pp. 899–908.[11] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional

ranking for multilabel image annotation,” arXiv preprint arXiv:1312.4894,2013.

[12] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networksfor text classification,” in Advances in Neural Information Processing

Systems, 2015, pp. 649–657.[13] Z. Cao, F. Wei, L. Dong, S. Li, and M. Zhou, “Ranking with recursive

neural networks and its application to multi-document summarization.”in AAAI, 2015, pp. 2153–2159.

[14] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deepcaptioning with multimodal recurrent neural networks (m-rnn),” arXiv

preprint arXiv:1412.6632, 2014.[15] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic

embeddings with multimodal neural language models,” arXiv preprint

arXiv:1411.2539, 2014.[16] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning

with neural networks,” in Advances in neural information processing

systems, 2014, pp. 3104–3112.[17] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for

language modeling.” in Interspeech, 2012, pp. 194–197.[18] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “Cnn-rnn: A

unified framework for multi-label image classification,” in Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 2285–2294.

[19] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolovet al., “Devise: A deep visual-semantic embedding model,” in Advances

in neural information processing systems, 2013, pp. 2121–2129.[20] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A

neural image caption generator,” in Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.[21] C. Lynch, K. Aryafar, and J. Attenberg, “Images don’t lie: Transferring

deep visual semantic features to large-scale multimodal learning to rank,”arXiv preprint arXiv:1511.06746, 2015.

[22] P. D. Howell, L. D. Martin, H. Salehian, C. Lee, K. M. Eastman, andJ. Kim, “Analyzing taste preferences from crowdsourced food entries,”in Proceedings of the 6th International Conference on Digital Health

Conference. ACM, 2016, pp. 131–140.[23] A. Srivastava, I. Jermyn, and S. Joshi, “Riemannian analysis of probability

density functions with applications in vision,” in Computer Vision and

Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007,pp. 1–8.

[24] J. Lee, “Riemannian geometry: An introduction to curvature, no. 176 ingraduate texts in mathematics,” 1997.

[25] M. Moakher, “A differential geometric approach to the geometric mean ofsymmetric positive-definite matrices,” SIAM Journal on Matrix Analysis

and Applications, vol. 26, no. 3, pp. 735–747, 2005.[26] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen,

and Y. Wu, “Learning fine-grained image similarity with deep ranking,”in Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2014, pp. 1386–1393.

[27] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semanticembeddings with multimodal neural language models,” arXiv preprint

arXiv:1411.2539, 2014.[28] C. Lynch, K. Aryafar, and J. Attenberg, “Images don’t lie: Transferring

deep visual semantic features to large-scale multimodal learning to rank,”arXiv preprint arXiv:1511.06746, 2015.

[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural

computation, vol. 9, no. 8, pp. 1735–1780, 1997.[30] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Efficient

estimation of word representations in vector space,” in Proceedings of

Workshop at ICLR, 2013.[31] F. Chollet, “Keras: Theano-based deep learning library,” Code:

https://github. com/fchollet. Documentation: http://keras. io, 2015.[32] Y. Wang, L. Wang, Y. Li, D. He, and T.-Y. Liu, “A theoretical analysis of

ndcg type ranking measures,” in Conference on Learning Theory, 2013,pp. 25–54.

510

Documents

A Deep Multi-Modal Pairwise Ranking Model for User Generated …leehyun/08508250.pdf · 2019-03-24 · 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis