Show, Attend, and Tell Neural Image Caption Generation ... · Show, Attend, and Tell Neural Image...

Show, Attend, and Tell Neural Image Caption Generation

with Visual AttentionKelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,

Ruslan Salakhutdinov, Richard S. Zemel, Yoshua BengioUniversity of Montreal and University of Toronto

Presented By:Hannah Li, Sivaraman K S

Introduction We can easily:

Segment, localize, and categorize

However,

Interpreting the image is more difficult

Goal of this work: Generate captions for images using attention mechanism2

Related Work - Generating Image Captions- Recurrent neural networks (Cho et al., 2014, Bahdanau et al., 2014,

Stuskever et al., 2014)

- LSTM for videos and images (Vinyals et al., 2014, Donahue et al., 2014)

- Joint CNN-RNN with object detection (Karpathy & Li, 2014, Fang et al., 2014)

- Attention (Larochelle & Hinton 2010)

Model Overview

Generates a caption y as a sequence of encoded words

Encoder: Convolutional FeaturesGoal: input raw image and produce a set of feature vectors (annotation vectors)

Produces L vectors (each a D-dimensional representation corresponding to part of an image)

Decoder: Long Short-Term Memory Network

Input, forget, memory, output and hidden state

W, U, Z: weight matricesb: biasesE: an embedding matrixzt: representation of the relevant part of the image at time t

Decoder: Long Short-Term Memory Network

Logistic sigmoid activationDeep output layer to compute the output word probability

Stochastic attention: the probability that location i is the correct place to focus on for producing the next word

Deterministic attention: the relative importance to give to location i in blending the ai’s together

Hard AttentionHard attention - learning to maximize the context vector z from a combination of a one-hot encoded variable st,i and the extracted features ai.

Trained using Sampling method

st - where the model decides to focus attention when generating the tth word

Stochastic - Assign a Multinoulli distribution

Soft AttentionLearning by maximizing the expectation of the context vector.

Trained End-to-End

Deterministic - Whole distribution optimized, not single choices (st not picked from a distribution)

TrainingThe attention framework learns latent alignments from scratch instead of explicitly using object detectors.

Allows the model to go beyond "objectness" and learn to attend to abstract concepts.

Dataset Flickr8k and Flickr30k datasets

5 reference captions

MS COCO dataset

Discarded caption in excess of 5

Applied basic tokenization

Fixed vocabulary size of 10K

Results

1. Significantly improve the state of the art performance METEOR on MS COCO 2. More flexibility - attend to non object salient regions

13Source: http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/12/Screen-Shot-2015-12-30-at-1.42.58-PM.png

Analysis of learning to attend

Mistakes

Reference● https://arxiv.org/pdf/1502.03044.pdf● http://kelvinxu.github.io/projects/capgen.html● http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nl

p/● https://blog.heuritech.com/2016/01/20/attention-mechanism/

Show, Attend, and Tell Neural Image Caption Generation ... · Show, Attend, and Tell Neural Image...

Documents

Show, Attend and Tell: Neural Image Caption Generation with …zemel/documents/captionAttnIcml-supp.pdf · Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Launch with caption

Photo caption sbc3_week1

Show and Tell: A Neural Image Caption Generatorの紹介

CAPTION WRITING - … · CAPTION WRITING Journalism 1. HOW TO WRITE A SOLID CAPTION Good captions don’t just tell what’s going on in the photo, the act as their own mini-stories

Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset

Caption Writing Tips Types Design. Tips for Caption Writing Accuracy, Caption-ese, Style, Identification

Figure Caption:

Show, Attend and Tell: Neural Image CaptionGeneration · PDF fileShow, Attend and Tell: Neural Image Caption Generation with Visual Attention ... our model to go beyond “objectness”

Cognitive Cyber-Physical Systems: Vision for the Next CPS Frontier … · 2019-10-18 · Neural Machine Translation, Luong et al. (2015) I Show, Attend and Tell: Neural Image Caption

Caption - TANJABTIMKAB.GO.ID

Tell me your story! - Australian Centre for International ... · Web viewTell me your story! TADEP Communications project Photo: (Attach photo) Caption: (Use information from table

Baby Caption

Photo Caption

Caption Writing

Don't Tell Others to Attend Your Event (Dr.Htet Zan Linn)

Caption Eng

Show, Attend and Tell: Neural Image …Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu KELVIN.XU@UMONTREAL.CA Jimmy Lei Ba JIMMY@PSI.UTORONTO.CA

Powerpoint caption

Visual Question Answering and Visual Reasoning · Pixel-BERT [1] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 [2] Stacked Attention Networks