Top-down Neural Attention

Top-down Neural AttentionS E L E C T I V E AT T E N T I O N F R O M A

D E E P N E U R A L N E T

1

General's Family by Octavio Ocampo

Background

2

Understanding Artificial Neural Networks

© Jianming Zhang, derivative work. Original image credit: soul wind / stock.adobe.com

Problem Definition

3

Deep CNN

• animal• elephant• zebra• grass• africa

elephant

Top-down Attention Map Top-down Signal

Probabilistic Winner-Take-All

4[1] Tsotsos et al. “Modeling Visual Attention via Selective Tuning.” Artificial Intelligence, 1995.

Winner-Take-All [1]

Marginal Winning Probability (MWP): Equivalent to an Absorbing Markov

Chain process.

output layer

Probabilistic WTA

Excitation BackpropAssumptions:§ The response of the activation neuron is non-negative.§ An activation neuron is tuned to detect certain visual features. Its response is positively

correlated to its confidence of the detection.

5

ActivationLayer N

ActivationLayer N-1

+++_

Inhibitory Neuron

Excitatory Neuron

Excitation BackpropAssumptions:§ The response of the activation neuron is non-negative.§ An activation neuron is tuned to detect certain visual features. Its response is positively

correlated to its confidence of the detection.

6

A Common Issue: Insensitiveness to Top-down Signals

7

zebra elephant

Dominant neurons always win

Contrastive Attention

8

zebra elephant

elephant zebra

Negating the Output Layer for Contrastive Signals

9

zebraclassifier

non-zebraclassifier

zebra map non-zebra map

Thanks to our Excitation Backprop formulation:§ Contrastive attention map can be computed by a single pass§ The pair of maps are well normalized with our probabilistic framework§ The pair of maps are positive-valued

Evaluation: The Pointing Game§ Task:

› Given an image and an object category, point to the targets.§ Metric:

› Pointing accuracy. › Pointing anywhere on the targets is fine.

§ Dataset:› VOC07 (20 categories)› COCO (80 categories)

§ CNN Models:› CNN-S [Chatfield et al. BMVC’14]› VGG16 [Simonyan et al. ICLR’15]› GoogleNet [Szegedy et al. CVPR’15]

§ Model training:› Multi-label cross-entropy loss

10

credit: elena milevska / stock.adobe.com

Results

11

Mean Accuracy over Object Categories in the Pointing Game

Qualitative Comparison

12

Text-to-Region Association§ Visualizing the top-down attention of a CNN classifier for ~18K tags.

13

Documents

Top-down Neural Attention