Deep Learning for Computer Vision: Attention Models (UPC 2016)

[course site]

Attention Models

Day 4 Lecture 6

Amaia Salvador

Attention Models: Motivation

Image: H x W x 3

The whole input volume is used to predict the output...

Image: H x W x 3

The whole input volume is used to predict the output...

...despite the fact that not all pixels are equally important

Attention models canrelieve computational burden

Helpful when processing big images !

Attention models canrelieve computational burden

Helpful when processing big images !

Encoder & Decoder

Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

From previous lecture...

The whole input sentence is used to produce the translation

Attention Models

Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

Attention Models

A bird flying over a body of water

Idea: Focus in different parts of the input as you make/refine predictions in time

E.g.: Image Captioning

LSTM Decoder

LSTMLSTM LSTMCNN LSTM

A bird flying

The LSTM decoder “sees” the input only at the beginning !

Features: D

Attention for Image Captioning

Image: H x W x 3

Features: L x D

Slide Credit: CS231n 10

Image: H x W x 3

Features: L x D

Slide Credit: CS231n

Attention weights (LxD)

Image: H x W x 3

Features: L x D

Weighted combination of features

First word

Weighted features: D

predicted word

Image: H x W x 3

Features: L x D

Weighted combination of features

First word

z2 y2Weighted

features: D

predicted word

Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 201514

Soft Attention

Image: H x W x 3

Grid of features(Each

D-dimensional)

Distribution over grid locations

pa + pb + pc + pc = 1

Soft attention:Summarize ALL locationsz = paa+ pbb + pcc + pdd

Derivative dz/dp is nice!Train with gradient descent

Context vector z

(D-dimensional)From RNN:

Soft Attention

Image: H x W x 3

Grid of features(Each

D-dimensional)

Distribution over grid locations

pa + pb + pc + pc = 1

Soft attention:Summarize ALL locationsz = paa+ pbb + pcc + pdd

Differentiable functionTrain with gradient descent

Context vector z

(D-dimensional)From RNN:

● Still uses the whole input !● Constrained to fix grid

Hard attention

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Not a differentiable function !

Can’t train with backprop :(

Hard attention:Sample a subset

of the input

need reinforcement learning

Gradient is 0 almost everywhereGradient is undefined at x = 0

Hard attention

Gregor et al. DRAW: A Recurrent Neural Network For Image Generation. ICML 2015

Generate images by attending to arbitrary regions of the output

Classify images by attending to arbitrary regions of the input

Hard attention

Gregor et al. DRAW: A Recurrent Neural Network For Image Generation. ICML 2015 21

Hard attention

22Graves. Generating Sequences with Recurrent Neural Networks. arXiv 2013

Read text, generate handwriting using an RNN that attends at different arbitrary regions over time

GENERATED

Hard attention

X x Y x 3

N bird

Can’t train with backprop :( 23

Spatial Transformer Networks

X x Y x 3

N bird

Jaderberg et al. Spatial Transformer Networks. NIPS 2015

Can’t train with backprop :(

Make it differentiable

Train with backprop :) 24

X x Y x 3

Can we make this function differentiable?

Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input

Repeat for all pixels in output to get a sampling grid

Then use bilinear interpolation to compute output

Network attends to input by predicting

Easy to incorporate in any network, anywhere !

Differentiable module

Insert spatial transformers into a classification network and it learns to attend and transform the input

Fine-grained classification

Visual Attention

Zhu et al. Visual7w: Grounded Question Answering in Images. arXiv 2016

Visual Question Answering

Visual Attention

Sharma et al. Action Recognition Using Visual Attention. arXiv 2016Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016

Salient Object DetectionAction Recognition in Videos

Other examples

Chen et al. Attention to Scale: Scale-aware Semantic Image Segmentation. CVPR 2016You et al. Image Captioning with Semantic Attention. CVPR 2016

Attention to scale for semantic segmentation

Semantic attentionFor image captioning

Resources

● CS231n Lecture @ Stanford [slides][video]● More on Reinforcement Learning● Soft vs Hard attention● Handwriting generation demo● Spatial Transformer Networks - Slides & Video by Victor Campos● Attention implementations:

○ Seq2seq in Keras○ DRAW & Spatial Transformers in Keras○ DRAW in Lasagne○ DRAW in Tensorflow

Deep Learning for Computer Vision: Attention Models (UPC 2016)

Science

Deep Attention Model for the Hierarchical Diagnosis of ...openaccess.thecvf.com/content_CVPRW_2019/papers/ISIC/Barata_Deep... · Deep Attention Model for the Hierarchical Diagnosis

Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intelligence)

Deep Learning for Computer Vision: Deep Networks (UPC 2016)

Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)

SADeepSense: Self-Attention Deep Learning Framework for

Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial Intelligence)

Deep Learning for Computer Vision: Backward Propagation (UPC 2016)

Deep Spatial-Semantic Attention for Fine-Grained Sketch ...homepages.inf.ed.ac.uk/thospeda/papers/song2017fgSBIR.pdf · Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based

Deep Learning for Computer Vision: Object Detection (UPC 2016)

Attention-Guided Image Compression by Deep Reconstruction

Deep Learning for Computer Vision: Medical Imaging (UPC 2016)

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)

Deep Learning for Computer Vision: Image Classification (UPC 2016)

Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)

Deep Learning for Computer Vision: Training (UPC 2016)

Deep Learning for Computer Vision: Face Recognition (UPC 2016)

Deep Learning for Computer Vision: Closing (UPC 2016)

Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Intelligence)

SA-NET: SHUFFLE ATTENTION FOR DEEP CONVOLUTIONAL …

Deep Attention Neural Tensor Network for Visual Question ...openaccess.thecvf.com/.../Yalong_Bai_Deep_Attention...Deep Attention Neural Tensor Network for Visual QuestionAnswering