Weakly Supervised Dense Video Captioningopenaccess.thecvf.com/content_cvpr_2017/poster/705... · 2018-01-23 · with only video-level sentence annotations. (2)We propose a novel dense

…

!"# !$# !%#

&!'()" &!'()$ &!'()%

*̂#

!"" !$" !%" *̂"

max-informative

maximize diversity

!"2 !$2 !%2 *̂2

maximize diversity

…

…

…… …

…

LM

LM

LM

coherence

coherence

coherence

iteration

(1) Tothebestofourknowledge,thisisthefirstworkfordensevideo captioningwithonlyvideo-levelsentenceannotations.

(2) Weproposeanoveldensevideocaptioningapproach,whichmodelsvisualcueswithLexical-FCN,discoversregion-sequencewithsubmodularmaximization,anddecodeslanguageoutputswithsequence-to-sequencelearning.Althoughtrainedwithweaklysupervisedsignal,it can produceinformativeanddiversecaptions.

(3) Weevaluatedensecaptioningresultsbymeasuringtheperformancegaptooracleresults,anddiversityofthedensecaptions.Thebestsinglecaptionoutperformsthestate-of-the-artresultsontheMSR-VTTchallenge significantly.

Contribution

We useanoisy-ORformulationtocombinetheprobabilitiesthattheindividualinstancesinthebagarenegative:

LexicalFCNModel

Overview

Region-SequenceGeneration

DiversityofDenseCaptions

Thediversityiscalculatedas:

Visualization

Results onMSR-VTT 2016 dataset

finf (xv ,At ) = pww∑ ; pw = max

i∈Atpiw

fcoh = 〈xrt ,xrs 〉rs∈At−1∑

fdiv = piw

w∫i=1

N

∑ log piw

qwdw

Informativeness ofaregion-sequenceisdefinedasthesumofeachregion’sinformativeness:

Coherence aimstoensurethetemporalcoherenceoftheregion-sequence,wetrytoselectregionswiththesmallestchangestemporally,anddefinethecoherencecomponentas:

Diversitymeasuresthedegreeofdifferencebetweenacandidateregion-sequenceandalltheexistingregion-sequences, which isdefinedwiththeKullback-Leibler divergenceas:

OverviewofourDenseVideoCaptioningframework.Inthelanguagemodel,<BOS>denotesthebegin-of-sentencetagand<EOS>denotestheend-of-sentencetag.Weusezerosas<pad>whenthereisnoinputatthetimestep.

p̂iw = P(yi

w = 1|Xi;θ ) = 1− (1− P(yiw = 1| xij;θ ))

xij∈Xi

∏

Wedefinethelossfunctionforabagofinstances.Aseachbaghasmultiplewordlabels,weadoptthecross- entropylosstomeasurethemulti-labelerrors:

Left: Diversityscoreofclusteredground-truthcaptionsunderdifferentclusternumbers. Right: Diversityscorecomparisonofour automaticmethod(middle)andtheground-truth.

VisualizationoflearnedresponsemapsfromthelastCNNlayer(left),andthecorrespondingnaturalsentences(right).Theblueareasintheresponsemapsareofhighattention,andtheregion-sequencesarehighlightedinwhitebounding-boxes.

Model METEOR BLEU@4 ROUGE-L CIDEr

ruc-uva 26.9 38.7 58.7 45.9

VideoLAB 27.7 39.1 60.6 44.1

Aalto 26.9 39.8 59.8 45.7

V2t_navigator 28.2 40.8 60.9 44.8

Ours 28.3 41.4 61.1 48.9

Ddiv =1n

(1− 〈si , s j∈S; i≠ j∑ si ,s j 〉)

where S isthesentencesetwithcardinalityn, and ⟨si, sj⟩ denotesthecosinesimilaritybetween si and sj.

…CNN CNN CNN

Frames3 x W x H

… … …AnchorfeaturesR x C x X x Y

Video Ground-truth Descriptions:1. 'a woman is taking a picture of children.'2. 'a man involving three children.’3. 'a group of people are looking at and taking pictures of a horse.'4. 'a short clip showcasing a champion horse.'5. 'a woman in a red blouse takes a picture.'6. 'kids are in playful mood.'7. 'kids are posing for a picture and being interviewed.'8. 'lady taking pictures of horse.'

20. 'three man is describing a car.'

Featureofselectedregions1 x C x X x Y

LSTM LSTM LSTM… LSTM LSTM LSTM

A woman is

LSTM

taking

LSTM

<EOS>

<BOS>

Encoder

…

…

Lexical labels

…

…WTAbasedsentence-to-region-sequenceassociation

"woman""taking"!

"children"

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

LSTM LSTM LSTM… LSTM LSTM LSTM LSTM LSTM…<pad><pad><pad>

<pad> <pad> <pad> <pad> <pad>

…

Region-sequencegeneration

Lexical Model Loss(MIML Learning)

Language Model Loss(Bi-S2VT Learning)

Decoder

Lexical-FCNmodel

Languagemodel

awoman isshowing theaudience howtobakecookies

thewoman holds thechild

Video

goldenbrowncookiestesttasting

video442

��

��

20 sentences of ground-truth

DenseVidCap 5-cluster of ground-truth

where θ isthemodelparameters, N isthenumberofbags, yi isthelabelvectorforbag Xi, and isthecorrespondingprobabilityvector.

where istheprobabilitywhenword w inthei-th bagispositive.

p̂i

p̂wi

TrainingLexical-FCN

LexicalFCN

woman

takes

blouse

red

photo

People

…

…

4x4gridregions

Lexicalvocabulary

CNN

IllustrationofDenseVidCap

��"�# �'��'�# �#� �&#!��*%

��"�# �'��'�# �#� �&#!��$))!�

��"�# �"��(*�) �'�)�! �#� (#��#$)��' "�# �"��(*�)

*��#��

��$# ��&*�#��( � ��#(��% �'$*#��)'*)��(+#�!�"��&��&�"��"�� #�# ��(+#�!�"��&��(� ��"��"�##&'��&��('��&��'�#+"��'�(+#�!�"��*��'�)''�#"��!�"�+�(��#(( �� "�'�(�� ''�#��"#(��&��"��#(��(��&�"��(+#�!�"��&��(� ��"��#)(�'#!�(��"��"��&�"��"��'#!�(��"��

Zhiqiang Shen†,Jianguo Li‡,ZhouSu‡,Minjun Li†, Yurong Chen‡,Yu-GangJiang†,Xiangyang Xue††ShanghaiKeyLaboratoryofIntelligentInformationProcessing, SchoolofComputerScience,Fudan University ‡IntelLabsChina

WeaklySupervisedDenseVideoCaptioning

L(X,y;θ ) = − 1N

yi ⋅log p̂i+(1−yi )⋅log(1−p̂i )[ ]i=1

N∑

Documents

Weakly Supervised Dense Video Captioningopenaccess.thecvf.com/content_cvpr_2017/poster/705... · 2018-01-23 · with only video-level sentence annotations. (2)We propose a novel dense