Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
…
!"# !$# !%#
&!'()" &!'()$ &!'()%
*̂#
!"" !$" !%" *̂"
max-informative
maximize diversity
!"2 !$2 !%2 *̂2
maximize diversity
…
…
…… …
…
LM
LM
LM
coherence
coherence
coherence
iteration
(1) Tothebestofourknowledge,thisisthefirstworkfordensevideo captioningwithonlyvideo-levelsentenceannotations.
(2) Weproposeanoveldensevideocaptioningapproach,whichmodelsvisualcueswithLexical-FCN,discoversregion-sequencewithsubmodularmaximization,anddecodeslanguageoutputswithsequence-to-sequencelearning.Althoughtrainedwithweaklysupervisedsignal,it can produceinformativeanddiversecaptions.
(3) Weevaluatedensecaptioningresultsbymeasuringtheperformancegaptooracleresults,anddiversityofthedensecaptions.Thebestsinglecaptionoutperformsthestate-of-the-artresultsontheMSR-VTTchallenge significantly.
Contribution
We useanoisy-ORformulationtocombinetheprobabilitiesthattheindividualinstancesinthebagarenegative:
LexicalFCNModel
Overview
Region-SequenceGeneration
DiversityofDenseCaptions
Thediversityiscalculatedas:
Visualization
Results onMSR-VTT 2016 dataset
finf (xv ,At ) = pww∑ ; pw = max
i∈Atpiw
fcoh = 〈xrt ,xrs 〉rs∈At−1∑
fdiv = piw
w∫i=1
N
∑ log piw
qwdw
Informativeness ofaregion-sequenceisdefinedasthesumofeachregion’sinformativeness:
Coherence aimstoensurethetemporalcoherenceoftheregion-sequence,wetrytoselectregionswiththesmallestchangestemporally,anddefinethecoherencecomponentas:
Diversitymeasuresthedegreeofdifferencebetweenacandidateregion-sequenceandalltheexistingregion-sequences, which isdefinedwiththeKullback-Leibler divergenceas:
OverviewofourDenseVideoCaptioningframework.Inthelanguagemodel,<BOS>denotesthebegin-of-sentencetagand<EOS>denotestheend-of-sentencetag.Weusezerosas<pad>whenthereisnoinputatthetimestep.
p̂iw = P(yi
w = 1|Xi;θ ) = 1− (1− P(yiw = 1| xij;θ ))
xij∈Xi
∏
Wedefinethelossfunctionforabagofinstances.Aseachbaghasmultiplewordlabels,weadoptthecross- entropylosstomeasurethemulti-labelerrors:
Left: Diversityscoreofclusteredground-truthcaptionsunderdifferentclusternumbers. Right: Diversityscorecomparisonofour automaticmethod(middle)andtheground-truth.
VisualizationoflearnedresponsemapsfromthelastCNNlayer(left),andthecorrespondingnaturalsentences(right).Theblueareasintheresponsemapsareofhighattention,andtheregion-sequencesarehighlightedinwhitebounding-boxes.
Model METEOR BLEU@4 ROUGE-L CIDEr
ruc-uva 26.9 38.7 58.7 45.9
VideoLAB 27.7 39.1 60.6 44.1
Aalto 26.9 39.8 59.8 45.7
V2t_navigator 28.2 40.8 60.9 44.8
Ours 28.3 41.4 61.1 48.9
Ddiv =1n
(1− 〈si , s j∈S; i≠ j∑ si ,s j 〉)
where S isthesentencesetwithcardinalityn, and ⟨si, sj⟩ denotesthecosinesimilaritybetween si and sj.
…CNN CNN CNN
Frames3 x W x H
… … …AnchorfeaturesR x C x X x Y
Video Ground-truth Descriptions:1. 'a woman is taking a picture of children.'2. 'a man involving three children.’3. 'a group of people are looking at and taking pictures of a horse.'4. 'a short clip showcasing a champion horse.'5. 'a woman in a red blouse takes a picture.'6. 'kids are in playful mood.'7. 'kids are posing for a picture and being interviewed.'8. 'lady taking pictures of horse.'
20. 'three man is describing a car.'
Featureofselectedregions1 x C x X x Y
LSTM LSTM LSTM… LSTM LSTM LSTM
A woman is
LSTM
taking
LSTM
<EOS>
<BOS>
Encoder
…
…
Lexical labels
…
…WTAbasedsentence-to-region-sequenceassociation
"woman""taking"!
"children"
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
LSTM LSTM LSTM… LSTM LSTM LSTM LSTM LSTM…<pad><pad><pad>
<pad> <pad> <pad> <pad> <pad>
…
Region-sequencegeneration
Lexical Model Loss(MIML Learning)
Language Model Loss(Bi-S2VT Learning)
Decoder
Lexical-FCNmodel
Languagemodel
awoman isshowing theaudience howtobakecookies
thewoman holds thechild
Video
goldenbrowncookiestesttasting
video442
����� ����� ����
�������������������
20 sentences of ground-truth
DenseVidCap 5-cluster of ground-truth
where θ isthemodelparameters, N isthenumberofbags, yi isthelabelvectorforbag Xi, and isthecorrespondingprobabilityvector.
where istheprobabilitywhenword w inthei-th bagispositive.
p̂i
p̂wi
TrainingLexical-FCN
LexicalFCN
woman
takes
blouse
red
photo
People
…
…
4x4gridregions
Lexicalvocabulary
CNN
IllustrationofDenseVidCap
��"�# �'��'�# �#� �&#!����*%
��"�# �'��'�# �#� �&#!����$))!�
��"�# �"���(*�) �'�)�! �#� (#��#$)��' "�# �"���(*�)
*���#���
����$# ��&*�#��( � ��#(������% �'$*#��)'*)����������������������������������(+#�!�"��&���&�"��"��� �#�# ���������������������������������(+#�!�"��&��(� ��"���"�##&'����������������������������������&���('��&��'�#+"��'�(+#�!�"���*������'�)''�#"�����������������������������������!�"�+�(�����#(( ��� �"�'�(���� �''�#���"#(��&��"���#(��(�������&�"����������������������������������(+#�!�"��&��(� ��"����#)(�'#!�(��"���"���&�"��"��'#!�(��"����������������������������������
Zhiqiang Shen†,Jianguo Li‡,ZhouSu‡,Minjun Li†, Yurong Chen‡,Yu-GangJiang†,Xiangyang Xue††ShanghaiKeyLaboratoryofIntelligentInformationProcessing, SchoolofComputerScience,Fudan University ‡IntelLabsChina
WeaklySupervisedDenseVideoCaptioning
L(X,y;θ ) = − 1N
yi ⋅log p̂i+(1−yi )⋅log(1−p̂i )[ ]i=1
N∑