Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Sequence to Sequence ‒Video to Text

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

ICCV 2015

M2 Soichiro Murakami

10/14/16 1

Introduction

10/14/16 2

Video

Text

10/14/16 3

A monkey is pulling a dog’s tail and is chased by the dog.

Main contribution• To propose a novel model, which learns to directly map a sequence of frames to a sequence of words

10/14/16 4

General seq2seq modela. handle a variable number of framesb. learn and use the temporal structure

of the videoc. learn a language model to generate

natural and grammatical sentences.

Fig.1

Related work 1/2• image caption [8, 40]

1. generate a fixed length vector representation of an image2. decode this vector into a sequence of words

• FGM [36]1. identify the semantic content (subject, verb, object, scene).2. combine them with confidences from a language model using a

factor graph to infer the most likey tuple in the video.3. generate a sentence based on a template.

• Mean Pool [39]• LSTMs are used to generate video descriptions by pooling the

representations of individual frames.

10/14/16 5

Related work 2/2• Temporal-Attention [43] (ICCV2015)

• employ a 3-D convnet model that incorporates spatiotemporal motion features to extract dense trajectory features (HoG, HoF, MBH).

• use an attention mechanism that learns to weight the frame features.

10/14/16 6

Approach 1/2

• 3.1 LSTM for sequence modeling• 3.2 Sequence to sequence video to text

10/14/16 7

p(y1, ..., ym|x1, ..., xn)seq. of video framesseq. of words

Fig. 2

concatenate

Zt: output of the second LSTM layer

Approach 2/2• 3.3 Video and text representation• RGB frames

• apply a CNN (pre-trained) to input images and provide the output of the top layer as input to the LSTM units. (AlexNet, 16-layer VGG model)

• Optical Flow• first extract classical variational optical flow features[2].• then create flow images and apply a CNN (pre-trained).

• Text• embed words to a lower 500 dimensional space by applying a linear

transformation to the input data.

10/14/16 8for the combined model.

Experimental Setup (1/3)• Video description datasets

• Microsoft Video Description Corpus (MSVD)• a collection of YouTube clips & single sentence descriptions from annotators.

• MPII Movie Description Dataset (MPII-MD)• Hollywood movies & movie scripts and audio description data.

• Montreal Video Annotation Dataset (M-VAD)• Hollywood movies & audio description data for the visually impaired.

ØThey used a single sentence as a target sentence for each video.

10/14/16 9

Experimental Setup (2/3)

10/14/16 10

Table 1. Corpus StatisticsExample of MPII-MD

( A Dataset for Movie Description, Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele, CVPR 2015)

Experimental Setup (3/3)• Evaluation Metrics

• METEOR [7]• METEOR compares exact token matches, stemmed tokens, paraphrase

matches, as well as semantically similar matches using WordNet synonyms.

• Experimental details of the models• unroll the LSTM to a fixed 80 time steps during training.

• for longer videos, truncated the number of frames.• for shorter videos, pad the remaining inputs with zeros.

• mini-batch size: up to 8 for AlexNet, up to 3 for flow model.

10/14/16 11

Results and Discussion ‒ MSVD dataset -

10/14/16 12

• S2VT AlexNet model on RGB video frames achieves 27.9% METEOR.

• The low performance of the flow model.

• Polysemous words• playing a guitar• playing golf

Results and Discussion ‒Movie description datasets-

10/14/16 13

• It was best to use dropout at the inputs and outputs of both LSTM layers.

• SMT [28]• translate holistic video

representations to a single sentence.• Visual-Labels [27]

• LSTM-based approach which uses no temporal encoding, but more diverse visual features, namely object detectors, as well as activity and scene classifiers.

10/14/16 14

10/14/16 15

Conclusion• They construct descriptions using a sequence to sequence

model, where frames are first read sequentially and then words are generated sequentially.

• Their model achieves state-of-the-art performance on the MSVD dataset.

• For further information...• https://www.cs.utexas.edu/~vsub/s2vt.html

10/14/16 16

Technology

Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)