Upload
soichiro-murakami
View
120
Download
0
Embed Size (px)
Citation preview
Sequence to Sequence ‒Video to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko
ICCV 2015
M2 Soichiro Murakami
10/14/16 1
Introduction
10/14/16 2
Video
Text
10/14/16 3
A monkey is pulling a dog’s tail and is chased by the dog.
Main contribution• To propose a novel model, which learns to directly map a sequence of frames to a sequence of words
10/14/16 4
General seq2seq modela. handle a variable number of framesb. learn and use the temporal structure
of the videoc. learn a language model to generate
natural and grammatical sentences.
Fig.1
Related work 1/2• image caption [8, 40]
1. generate a fixed length vector representation of an image2. decode this vector into a sequence of words
• FGM [36]1. identify the semantic content (subject, verb, object, scene).2. combine them with confidences from a language model using a
factor graph to infer the most likey tuple in the video.3. generate a sentence based on a template.
• Mean Pool [39]• LSTMs are used to generate video descriptions by pooling the
representations of individual frames.
10/14/16 5
Related work 2/2• Temporal-Attention [43] (ICCV2015)
• employ a 3-D convnet model that incorporates spatiotemporal motion features to extract dense trajectory features (HoG, HoF, MBH).
• use an attention mechanism that learns to weight the frame features.
10/14/16 6
Approach 1/2
• 3.1 LSTM for sequence modeling• 3.2 Sequence to sequence video to text
10/14/16 7
p(y1, ..., ym|x1, ..., xn)seq. of video framesseq. of words
Fig. 2
concatenate
Zt: output of the second LSTM layer
Approach 2/2• 3.3 Video and text representation• RGB frames
• apply a CNN (pre-trained) to input images and provide the output of the top layer as input to the LSTM units. (AlexNet, 16-layer VGG model)
• Optical Flow• first extract classical variational optical flow features[2].• then create flow images and apply a CNN (pre-trained).
• Text• embed words to a lower 500 dimensional space by applying a linear
transformation to the input data.
10/14/16 8for the combined model.
Experimental Setup (1/3)• Video description datasets
• Microsoft Video Description Corpus (MSVD)• a collection of YouTube clips & single sentence descriptions from annotators.
• MPII Movie Description Dataset (MPII-MD)• Hollywood movies & movie scripts and audio description data.
• Montreal Video Annotation Dataset (M-VAD)• Hollywood movies & audio description data for the visually impaired.
ØThey used a single sentence as a target sentence for each video.
10/14/16 9
Experimental Setup (2/3)
10/14/16 10
Table 1. Corpus StatisticsExample of MPII-MD
( A Dataset for Movie Description, Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele, CVPR 2015)
Experimental Setup (3/3)• Evaluation Metrics
• METEOR [7]• METEOR compares exact token matches, stemmed tokens, paraphrase
matches, as well as semantically similar matches using WordNet synonyms.
• Experimental details of the models• unroll the LSTM to a fixed 80 time steps during training.
• for longer videos, truncated the number of frames.• for shorter videos, pad the remaining inputs with zeros.
• mini-batch size: up to 8 for AlexNet, up to 3 for flow model.
10/14/16 11
Results and Discussion ‒ MSVD dataset -
10/14/16 12
• S2VT AlexNet model on RGB video frames achieves 27.9% METEOR.
• The low performance of the flow model.
• Polysemous words• playing a guitar• playing golf
Results and Discussion ‒Movie description datasets-
10/14/16 13
• It was best to use dropout at the inputs and outputs of both LSTM layers.
• SMT [28]• translate holistic video
representations to a single sentence.• Visual-Labels [27]
• LSTM-based approach which uses no temporal encoding, but more diverse visual features, namely object detectors, as well as activity and scene classifiers.
10/14/16 14
10/14/16 15
Conclusion• They construct descriptions using a sequence to sequence
model, where frames are first read sequentially and then words are generated sequentially.
• Their model achieves state-of-the-art performance on the MSVD dataset.
• For further information...• https://www.cs.utexas.edu/~vsub/s2vt.html
10/14/16 16