From Captions to Visual Concepts and Back

Hao Fang1,∗, Saurabh Gupta2,∗, Forrest Iandola2,∗, Rupesh K. Srivastava3,∗, Li Deng4, Piotr Dollár5

Jianfeng Gao4, Xiaodong He4, Margaret Mitchell4, John C. Platt6, C. Lawrence Zitnick4, Geoffrey Zweig4

∗Equal Contribution, 1U. of Washington, 2UC Berkeley, 3IDSIA, USI-SUPSI, 4Microsoft Research, 5Facebook AI Research, 6Google

Figure 1: An illustrative example of our pipeline.

When does a machine “understand” an image? One definition is whenit can generate a novel caption that summarizes the salient content within animage. This content may include objects that are present, their attributes, orrelations between objects. Determining the salient content requires not onlyknowing the contents of an image, but also deducing which aspects of thescene may be interesting or novel through commonsense knowledge.

This paper describes a novel approach for generating image captionsfrom samples. Previous approaches to generating image captions relied onobject, attribute, and relation detectors learned from separate hand-labeledtraining data [22, 47]. Instead, we train our caption generator from a dataset of images and corresponding image descriptions.

Figure 1 shows the overview of our approach. We first use MultipleInstance Learning to learn detectors for words that occur in image captions.We then train a maximum entropy language model conditioned on the setof detected words and use it to generate candidate captions for the image.Finally, we learn to re-rank these generated sentences and output the highestscoring sentence.

Our evaluation was performed on the challenging MS COCO dataset[4, 28] containing complex images with multiple objects. Examples resultsare shown in Figure 2. We use the multiple metrics and better/worse/equalcomparisons by human subjects on Amazon’s Mechanical Turk to evaluatethe quality of our automatic captions on a subset of the validation set.

We also submitted generated captions for the test set to the officialCOCO evaluation server (results in Fig. 3). Surprisingly, our generated cap-tions match or outperform humans on 12 out of 14 official metrics. We alsooutperform other public results on all official metrics. When evaluated byhuman subjects, our captions were judged to be of the same or better qualitythan humans 34% of the time. Our results demonstrate the utility of trainingboth visual detectors and LMs directly on image captions, as well as usinga global semantic model for re-ranking the caption candidates.

This is an extended abstract. Full paper available at the Computer Vision Foundation page.

Figure 2: Qualitative results for several randomly chosen images on theMS COCO dataset, with our generated caption (black) and a human caption(blue) for each image. Top rows also show MIL based localization. Moreexamples can be found on the project website: http://research.microsoft.com/image_captioning.

CIDEr BLEU-4 BLEU-1 ROUGE-L METEOR

[5] .912 (.854) .291 (.217) .695 (.663) .519 (.484) .247 (.252)[40] .925 (.910) .567 (.471) .880 (.880) .662 (.626) .331 (.335)

Figure 3: COCO evaluation server results on test set (40,775 images). Firstrow show results using 5 reference captions, second row, 40 references. Hu-man results reported in parentheses.

From Captions to Visual Concepts and Back - cv-foundation.org · Equal Contribution, 1U. of...

Documents

Discriminative and Consistent ... - cv-foundation.org

12 - dr-rath-foundation.org...Инновативная лекция д-ра Рата в Стэндфордском университете 4 мая 2002 мне была оказана

EBSM cursos EFA...Cidadania e Empregabilidade – CE - 4UC 3 100 Linguagem e Comunicação – LC 6UC = 4UC + 2UC (Inglês) 3 150 2 Matem á ca para a Vida – MV - 4UC 3 100 TIC -

Годовой отчет - istoki-foundation.org...Поддержанные программы и проекты 2016 года 15 2. ... Наименование фонда на английском

Www.cedar-foundation.org The Cedar Foundation Joanne Barnes & Shauna Smyth

bozza volantino orizzontale con sponsor versione 4Microsoft Word - bozza volantino orizzontale con sponsor versione 4 Author davidez Created Date 12/6/2016 1:16:55 PM

Bestätigungsvermerk - carbonmarket-foundation.org 2015.… · Thomas P. Forth Freiberuflicher Politikberater, Berlin Stellvertretende Vorsitzende ... Dr. Silke Karcher (Regierungsdirektorin,

Vita - aquinas-foundation.org

Musical Mathematics: Look Inside - Chrysalis Foundationchrysalis-foundation.org/Musical_Mathematics_Look_Inside.pdf · Musical Mathematics: ... I Human perception of the harmonic

FUTUROS LÍDERES CHINOS - spain-china-foundation.org...• mondragÓn internacional • real madrid club de fÚtbol • tÉcnicas reunidas internacional • telefÓnica • universidad

De la primera Farmacia del León en ... - leo-foundation.org

INTEGRATED REPORT 2019 - gca-foundation.org

Letter From The President T - sp-foundation.org

INSPIRE - scf-foundation.org

The Informal Georgian-Abkhazian Dialogue Processimage.berghof-foundation.org/fileadmin/redaktion/Publications/...The Informal Georgian-Abkhazian Dialogue Process . ... mediation and

Forward Motion Deblurring - cv-foundation.org

Mobile Health Care - lao-foundation.org

RAPPORT INTÉGR É 2019 - gca-foundation.org

ramla-foundation.orgramla-foundation.org/files/mechrazim/shiltey_hotsot_2018.docx · Web viewשאלות ההבהרה ינוסחו ע"ג קובץ word בלבד אשר יועבר

Understanding Road Layout From Videos as a Whole...Buyu Liu1 Bingbing Zhuang1 Samuel Schulter1 Pan Ji1 Manmohan Chandraker1,2 1NEC Laboratories America 2UC San Diego Abstract In this