DrillDown: Interactive Retrieval of Complex Scenes Using...

Preview:

Citation preview

DrillDown: Interactive Retrieval of Complex Scenes Using Natural Language Queries

When we’d like to retrieve an image of a complex scene

Difficult to describe the whole scene in one sentence

Image Search Engine

Single sentence as queryNo refinement (no interaction)

Find a specific image in our gallery album

or online image collection

Image Retrieval with Multiple Rounds Queries

Drill-down: Interactive Retrieval of Complex Scenes using Natural Language QueriesFuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez.Conf. on Neural Information Processing Systems. NeurIPS 2019. Vancouver, Canada. December 2019.

Previous efforts on Image-Text Matching

Two women sitting on the sofa

Woman in white shirt holding a dog

Woman in yellow shirt holding a cat

CNN RNN

1D Feature Space

[1] DeViSE: A Deep Visual-Semantic Embedding Model. Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov. NIPS 2013.[2] Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Andrej Karpathy, Armand Joulin, Li Fei-Fei. NIPS 2014

Previous efforts on Image-Text Matching

[3] Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations. Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma. CVPR 2019.

Observations

Feature channels

Sp

atia

l dim

ensi

on

s2D image representation can help distinguish instances sharing the same feature subspace

Observations

Feature channels

Sp

atia

l dim

ensi

on

s

Two women sitting on the sofa

Woman in white shirt holding a dog

Woman in yellow shirt holding a cat

1D sentence representation can NOT distinguish instances sharing the same feature subspace

Observations

Feature channels

Sp

atia

l dim

ensi

on

s

Two women sitting on the sofa

Woman in white shirt holding a dog

Woman in yellow shirt holding a cat

2D sentence representation

“person” subspace

“dog” subspace

“cat” subspace

Instance1

Instance2

Instance3

We still want compact representations

Especially, if it is for retrieval applications

Feature vector 1Sentence 1

Feature vector 2Sentence 2

Feature vector 3Sentence 3

...

Text input

Pre-allocated state vectors

Text feature

Action: which state vector to

update

Update the state vector

Pairwise alignment between state vectors and

image regions

Simulated queries through region-phrase annotations at training time

Human queries

Quantitative evaluation on a test set of 10000 images

Although, the more state vectors,

the better

Although, the more state vectors,

the better

We could have an even more compact representation

Quantitative evaluation on a test set of 10000 images

Quantitative evaluation on a test set of 10000 images

Target

Target

Target

Target

Target

Target

Target

Future work: instance aware text encoder for dialog based applications?

Potential challenges:● Named entity detection● Coreference resolution● Negation● ...

Q&A

Recommended