1
! " # ! $ # ! % # &!'() " &!'() $ &!'() % # ! " " ! $ " ! % " " max-informative maximize diversity ! " 2 ! $ 2 ! % 2 2 maximize diversity LM LM LM coherence coherence coherence iteration (1) To the best of our knowledge, this is the first work for dense video captioning with only video-level sentence annotations. (2) We propose a novel dense video captioning approach, which models visual cues with Lexical-FCN, discovers region-sequence with submodular maximization, and decodes language outputs with sequence-to-sequence learning. Although trained with weakly supervised signal, it can produce informative and diverse captions. (3) We evaluate dense captioning results by measuring the performance gap to oracle results, and diversity of the dense captions. The best single caption outperforms the state-of-the-art results on the MSR-VTT challenge significantly. Contribution We use a noisy-OR formulation to combine the probabilities that the individual instances in the bag are negative: Lexical FCN Model Overview Region-Sequence Generation Diversity of Dense Captions The diversity is calculated as: Visualization Results on MSR-VTT 2016 dataset f inf (x v , A t ) = p w w ; p w = max i A t p i w f coh = x r t , x r s r s A t 1 f div = p i w w i =1 N log p i w q w dw Informativeness of a region-sequence is defined as the sum of each region’s informativeness: Coherence aims to ensure the temporal coherence of the region-sequence, we try to select regions with the smallest changes temporally, and define the coherence component as: Diversity measures the degree of difference between a candidate region- sequence and all the existing region-sequences, which is defined with the Kullback-Leibler divergence as: Overview of our Dense Video Captioning framework. In the language model, <BOS> denotes the begin-of-sentence tag and <EOS> denotes the end-of-sentence tag. We use zeros as <pad> when there is no input at the time step. ˆ p i w = P( y i w = 1| X i ;θ ) = 1 (1 P( y i w = 1| x ij ;θ )) x ij X i We define the loss function for a bag of instances. As each bag has multiple word labels, we adopt the cross- entropy loss to measure the multi-label errors: Left: Diversity score of clustered ground-truth captions under different cluster numbers. Right: Diversity score comparison of our automatic method (middle) and the ground-truth. Visualization of learned response maps from the last CNN layer (left), and the corresponding natural sentences (right). The blue areas in the response maps are of high attention, and the region-sequences are highlighted in white bounding-boxes. Model METEOR BLEU@4 ROUGE-L CIDEr ruc-uva 26.9 38.7 58.7 45.9 VideoLAB 27.7 39.1 60.6 44.1 Aalto 26.9 39.8 59.8 45.7 V2t_navigator 28.2 40.8 60.9 44.8 Ours 28.3 41.4 61.1 48.9 D div = 1 n (1 −〈 s i , s j S; i j s i , s j ) where S is the sentence set with cardinality n, and s i , s j denotes the cosine similarity between s i and s j . CNN CNN CNN Frames 3xWxH Anchor features RxCxXxY Video Ground-truth Descriptions: 1. 'a woman is taking a picture of children.' 2. 'a man involving three children.’ 3. 'a group of people are looking at and taking pictures of a horse.' 4. 'a short clip showcasing a champion horse.' 5. 'a woman in a red blouse takes a picture.' 6. 'kids are in playful mood.' 7. 'kids are posing for a picture and being interviewed.' 8. 'lady taking pictures of horse.' 20. 'three man is describing a car.' Feature of selected regions 1xCxXxY LSTM LSTM LSTM LSTM LSTM LSTM A woman is LSTM taking LSTM <EOS> <BOS> Encoder Lexical labels WTA based sentence-to-region-sequence association " woman " "taking " ! " children " LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> Region-sequence generation Lexical Model Loss (MIML Learning) Language Model Loss (Bi-S2VT Learning) Decoder Lexical-FCN model Language model a woman is showing the audience how to bake cookies the woman holds the child Video golden brown cookies test tasting 20 sentences of ground-truth DenseVidCap 5-cluster of ground-truth where θ is the model parameters, N is the number of bags, y i is the label vector for bag X i , and is the corresponding probability vector. where is the probability when word w in the i-th bag is positive. ˆ p i ˆ p w i Lexical FCN wom an takes blous e red photo Peopl e 4x4 grid regions Lexical vocabulary CNN Illustration of DenseVidCap "# ' '## &#! *% "# ' '## &#! $))! "# " (*) ' )!# (# #$)' "# " (*) $# &*#( #(% '$*#)'*) (+# !" & &"" ## (+# !" & (" "##&' &(' & '#+" ' (+# !" * ')''#" !" +( #(( "' ( '' # "#(& " #( ( &" (+# !" & (" #)( '#!(" " &"" '#!(" Zhiqiang Shen†, Jianguo Li‡, Zhou Su‡, Minjun Li†, Yurong Chen‡, Yu-Gang Jiang†, Xiangyang Xue† †Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University ‡Intel Labs China Weakly Supervised Dense Video Captioning L (X, y;θ ) = 1 N y i log ˆ p i + (1 y i ) log(1 ˆ p i ) [ ] i =1 N

Weakly Supervised Dense Video Captioningopenaccess.thecvf.com/content_cvpr_2017/poster/705... · 2018-01-23 · with only video-level sentence annotations. (2)We propose a novel dense

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Weakly Supervised Dense Video Captioningopenaccess.thecvf.com/content_cvpr_2017/poster/705... · 2018-01-23 · with only video-level sentence annotations. (2)We propose a novel dense

!"# !$# !%#

&!'()" &!'()$ &!'()%

*̂#

!"" !$" !%" *̂"

max-informative

maximize diversity

!"2 !$2 !%2 *̂2

maximize diversity

…… …

LM

LM

LM

coherence

coherence

coherence

iteration

(1) Tothebestofourknowledge,thisisthefirstworkfordensevideo captioningwithonlyvideo-levelsentenceannotations.

(2) Weproposeanoveldensevideocaptioningapproach,whichmodelsvisualcueswithLexical-FCN,discoversregion-sequencewithsubmodularmaximization,anddecodeslanguageoutputswithsequence-to-sequencelearning.Althoughtrainedwithweaklysupervisedsignal,it can produceinformativeanddiversecaptions.

(3) Weevaluatedensecaptioningresultsbymeasuringtheperformancegaptooracleresults,anddiversityofthedensecaptions.Thebestsinglecaptionoutperformsthestate-of-the-artresultsontheMSR-VTTchallenge significantly.

Contribution

We useanoisy-ORformulationtocombinetheprobabilitiesthattheindividualinstancesinthebagarenegative:

LexicalFCNModel

Overview

Region-SequenceGeneration

DiversityofDenseCaptions

Thediversityiscalculatedas:

Visualization

Results onMSR-VTT 2016 dataset

finf (xv ,At ) = pww∑ ;    pw = max

i∈Atpiw

fcoh = 〈xrt ,xrs 〉rs∈At−1∑

fdiv = piw

w∫i=1

N

∑ log piw

qwdw

Informativeness ofaregion-sequenceisdefinedasthesumofeachregion’sinformativeness:

Coherence aimstoensurethetemporalcoherenceoftheregion-sequence,wetrytoselectregionswiththesmallestchangestemporally,anddefinethecoherencecomponentas:

Diversitymeasuresthedegreeofdifferencebetweenacandidateregion-sequenceandalltheexistingregion-sequences, which isdefinedwiththeKullback-Leibler divergenceas:

OverviewofourDenseVideoCaptioningframework.Inthelanguagemodel,<BOS>denotesthebegin-of-sentencetagand<EOS>denotestheend-of-sentencetag.Weusezerosas<pad>whenthereisnoinputatthetimestep.

p̂iw = P(yi

w = 1|Xi;θ ) = 1− (1− P(yiw = 1| xij;θ ))

xij∈Xi

Wedefinethelossfunctionforabagofinstances.Aseachbaghasmultiplewordlabels,weadoptthecross- entropylosstomeasurethemulti-labelerrors:

Left: Diversityscoreofclusteredground-truthcaptionsunderdifferentclusternumbers. Right: Diversityscorecomparisonofour automaticmethod(middle)andtheground-truth.

VisualizationoflearnedresponsemapsfromthelastCNNlayer(left),andthecorrespondingnaturalsentences(right).Theblueareasintheresponsemapsareofhighattention,andtheregion-sequencesarehighlightedinwhitebounding-boxes.

Model METEOR BLEU@4 ROUGE-L CIDEr

ruc-uva 26.9 38.7 58.7 45.9

VideoLAB 27.7 39.1 60.6 44.1

Aalto 26.9 39.8 59.8 45.7

V2t_navigator 28.2 40.8 60.9 44.8

Ours 28.3 41.4 61.1 48.9

Ddiv =1n

(1− 〈si , s j∈S;  i≠ j∑ si ,s j 〉)

where S isthesentencesetwithcardinalityn, and ⟨si, sj⟩ denotesthecosinesimilaritybetween si and sj.

…CNN CNN CNN

Frames3 x W x H

… … …AnchorfeaturesR x C x X x Y

Video Ground-truth Descriptions:1. 'a woman is taking a picture of children.'2. 'a man involving three children.’3. 'a group of people are looking at and taking pictures of a horse.'4. 'a short clip showcasing a champion horse.'5. 'a woman in a red blouse takes a picture.'6. 'kids are in playful mood.'7. 'kids are posing for a picture and being interviewed.'8. 'lady taking pictures of horse.'

20. 'three man is describing a car.'

Featureofselectedregions1 x C x X x Y

LSTM LSTM LSTM… LSTM LSTM LSTM

A woman is

LSTM

taking

LSTM

<EOS>

<BOS>

Encoder

Lexical labels

…WTAbasedsentence-to-region-sequenceassociation

"woman""taking"!

"children"

⎢⎢⎢⎢

⎥⎥⎥⎥

LSTM LSTM LSTM… LSTM LSTM LSTM LSTM LSTM…<pad><pad><pad>

<pad> <pad> <pad> <pad> <pad>

Region-sequencegeneration

Lexical Model Loss(MIML Learning)

Language Model Loss(Bi-S2VT Learning)

Decoder

Lexical-FCNmodel

Languagemodel

awoman isshowing theaudience howtobakecookies

thewoman holds thechild

Video

goldenbrowncookiestesttasting

video442

����� ����� ����

�������������������

20 sentences of ground-truth

DenseVidCap 5-cluster of ground-truth

where θ isthemodelparameters, N isthenumberofbags, yi isthelabelvectorforbag Xi, and isthecorrespondingprobabilityvector.

where istheprobabilitywhenword w inthei-th bagispositive.

p̂i

p̂wi

TrainingLexical-FCN

LexicalFCN

woman

takes

blouse

red

photo

People

4x4gridregions

Lexicalvocabulary

CNN

IllustrationofDenseVidCap

��"�# �'��'�# �#� �&#!����*%

��"�# �'��'�# �#� �&#!����$))!�

��"�# �"���(*�) �'�)�! �#� (#��#$)��' "�# �"���(*�)

*���#���

����$# ��&*�#��( � ��#(������% �'$*#��)'*)����������������������������������(+#�!�"��&���&�"��"��� �#�# ���������������������������������(+#�!�"��&��(� ��"���"�##&'����������������������������������&���('��&��'�#+"��'�(+#�!�"���*������'�)''�#"�����������������������������������!�"�+�(�����#(( ��� �"�'�(���� �''�#���"#(��&��"���#(��(�������&�"����������������������������������(+#�!�"��&��(� ��"����#)(�'#!�(��"���"���&�"��"��'#!�(��"����������������������������������

Zhiqiang Shen†,Jianguo Li‡,ZhouSu‡,Minjun Li†, Yurong Chen‡,Yu-GangJiang†,Xiangyang Xue††ShanghaiKeyLaboratoryofIntelligentInformationProcessing, SchoolofComputerScience,Fudan University ‡IntelLabsChina

WeaklySupervisedDenseVideoCaptioning

L(X,y;θ ) = − 1N

yi ⋅log p̂i+(1−yi )⋅log(1−p̂i )[ ]i=1

N∑