Jin-Hwa�Kim�
BI�Lab,�Seoul�National�University
Multimodal�Residual�Learning�for�Visual�Question-Answering
!""!#$% #&"
!"% "!#$% #&"
% "!#$% #&" '
% "!#$% #&" (
BS06)*! +#*,- #$% #&"
)*! +#*,+$ ./% +0"1!, " ) 2.3&45%$ 6% 43 " #&" 7 37"#+$ $ 36%8 +$ 9%034*. " 7 :;%< 9& ł =8 6" &% 7 6#"#%5
9 " $ 7 .!4*"#%5 7 %0" + >! + ? %0 %< 3$ 6# #$% #&" 87 "##&" " )*! +#*, " 4$3@ ," "#%+ #$% #&" &"+$ 7 "##&" "
% .! 9%0! # 2 2 6% %# ' 33 8 7 %0)* 2A6',B = 8 45! 3@ "6% 2( " . ł+ > )*! +#*, " 7 7 2 CDEEFGEGH %07 7 7
% .! 9 36% 7 %0"#3 " " 7 7 : %0%< +0" 7 " 2 27 < & % " 7 7 : %0 7 $ " + > 7 2 FI-JGK %< . 6# : L 43 7 %0 7
%łMN 36% " +
Table�of�Contents
1.�VQA:�Visual�Question�Answering�
2.�Vision�Part�
3.�Question�Part�
4.�Multimodal�Residual�Learning�
5.�Results�
6.�Discussion�
7.�Recent�Works�
8.�Q&A
1.�VQA:�Visual�Question�Answering
1.�VQA:�Visual�Question�Answering
VQA�is�a�new�dataset�containing�open-ended�questions�about�images,� for�understanding�of�vision,�language�and�commonsense�knowledge�to�
answer.
VQA�Challenge
���������������������� �����������������������������������
������������������������������ �������������� ������������
������ �������������!!������������������������������!!�
����������������� ���������������������� �����
���������������������� �����������������������������������
������������������������������ �������������� ������������
������ �������������!!������������������������������!!�
����������������� ���������������������� �����
Antol�et�al.,�ICCV�2015
1.�VQA:�Visual�Question�Answering
Examples�of�Visual-Question,�or�Only-Question�answers�by�Human
VQA�Challenge
Antol�et�al.,�ICCV�2015
1.�VQA:�Visual�Question�Answering
#�images:�204,721�(MS�COCO)�
#�questions:�760K�(3�per�image)�
#�answers:�10M�(10�per�question�+�ɑ)
Numbers
Images Questions Answers
Training 80K 240K 2.4M
Validation 40K 120K 1.2M
Test 80K 240K
Antol�et�al.,�ICCV�2015
1.�VQA:�Visual�Question�Answering
80K�test�images�/�Four�splits�of�20K�images�each��
Test-dev�(development)�
Debugging�and�Validation�-�10/day�submission�to�the�evaluation�server.��
Test-standard�(publications)�
Used�to�score�entries�for�the�Public�Leaderboard.��
Test-challenge�(competitions)�
Used�to�rank�challenge�participants.��
Test-reserve�(check�overfitting)��
Used�to�estimate�overfitting.�Scores�on�this�set�are�never�released.
Test�Dataset
Slide�adapted�from:�MSCOCO�Detection/Segmentation�Challenge,�ICCV�2015
2.�Vision�Part
2.�Vision�Part
ResNet:�A�Thing�among�Convolutional�Neural�Networks
He�et�al.,�CVPR�2016
1st�place�on�the�ImageNet�2015�classification�task�
1st�place�on�the�ImageNet�2015�detection�task�
1st�place�on�the�ImageNet�2015�localization�task�
1st�place�on�the�COCO�object�detection�task�
1st�place�on�the�COCO�object�segmentation�task
2.�Vision�Part
ResNet-152
He�et�al.,�CVPR�2016
Conv�1x1,�512
Conv�3x3,�512
Conv�1x1,�2048
AveragePooling
7x7x2048
1x1x2048
Linear
Output�Size
Softmax
152-Layered�Convolutional��
Neural�Networks
3x224x224 Input�Size
���������������������� �����������������������������������
������������������������������ �������������� ������������
������ �������������!!������������������������������!!�
����������������� ���������������������� �����
2.�Vision�Part
ResNet-152
He�et�al.,�CVPR�2016
Conv�1x1,�512
Conv�3x3,�512
Conv�1x1,�2048
AveragePooling
7x7x2048
1x1x2048
Linear
Output�Size
Softmax
As�a�visual�feature�extractor
3x224x224 Input�Size
���������������������� �����������������������������������
������������������������������ �������������� ������������
������ �������������!!������������������������������!!�
����������������� ���������������������� �����
2.�Vision�Part
ResNet-152
He�et�al.,�CVPR�2016
Pre-trained�models�are�available!�
For�Torch,�TensorFlow,�Theano,�Caffe,�Lasagne,�Neon,�and�MatConvNet�
https://github.com/KaimingHe/deep-residual-networks
3.�Question�Part
3.�Question�Part
Word-Embedding
What�color�are�her�eyes?
what���color���are���her���eyes���?
preprocessing
indexing
53�������7������44���127�����2���6177
w53�|�w7�|�w44�|�w127�|�w2�|�w6177
lookup
w1T
w2T
⋮
w7T
⋮
w44T
⋮
Lookup Table
{wi} are learning parameters for back-propagation algorithm.
3.�Question�Part
Question-Embedding
RNNw53�(what)
h0h1
RNNw7�(color)
h1h2
RNNw6177�(?)
h5h6
Step 0
Step 1
Step 5 use this
3.�Question�Part
Choice�of�RNN:�Gated�Recurrent�Units�(GRU)
Cho�et�al.,�EMNLP�2014Chung�et�al.,�arXiv�2014
hst-1
st-1h
z�=�σ(xtUz�+�st-1Wz)�
r�=�σ(xtUr�+�st-1Wr)�
h�=�tanh(xtUh�+�(st-1⚬r)Wh)�
st�=�(1�-�z)⚬h�+�z⚬st-1
3.�Question�Part
Skip-Thought�Vectors
Pre-trained�model�for�word-embedding�and�question-embedding
Kiros�et�al.,�NIPS�2015
I�got�back�home.�
I�could�see�the�cat�on�the�steps.�
This�was�strange.
try to reconstruct the previous sentence and next sentence
BookCorpus dataset (Zhu�et�al.,�ArXiv�2015)
3.�Question�Part
Skip-Thought�Vectors
Pre-trained�model�for�word-embedding�and�question-embedding�
Its�encoder�as�Sent2Vec�model
Kiros�et�al.,�NIPS�2015
w1T
w2T
w3T
w4T
⋮
Lookup Table
Gated Recurrent
Units
Pre-trained GRU
3.�Question�Part
Skip-Thought�Vectors
Pre-trained�model�(Theano)�and�porting�code�(Torch)�are�available!�
https://github.com/ryankiros/skip-thoughts�
https://github.com/HyeonwooNoh/DPPnet/tree/master/
003_skipthoughts_porting�
Noh�et�al.,�CVPR�2016Kiros�et�al.,�NIPS�2015
4.�Multimodal�Residual�Learning
4.�Multimodal�Residual�Learning
Idea�1:�Deep�Residual�Learning
Extend�the�idea�of�deep�residual�learning�for�multimodal�learning�
He�et�al.,�CVPR�2016
identity
weight layer
weight layer
relu
relu
F(x)!+!x
x
F(x) x
Figure 2. Residual learning: a building block.
4.�Multimodal�Residual�Learning
Idea�2:�Hadamard�product�for�Joint�Residual�Mapping
One�modality�is�directly�involved�in�the�gradient�with�respect�to�the�other�
modality�
https://github.com/VT-vision-lab/VQA_LSTM_CNN
vQ vI
tanh tanh
⊙
softmax
∂σ (x)!σ (y)∂x
= diag( ′σ (x)!σ (y))
Scaling Problem?Wu�et�al.,�NIPS�2016
4.�Multimodal�Residual�Learning
Multimodal�Residual�Networks
Kim�et�al.,�NIPS�2016
Q
V
ARNN
CNN
softmax
Multimodal Residual Networks
What kind of animals are these ?
sheep
wordembedding
question shortcuts
Hadamard products
word2vec (Mikolov et al., 2013)
skip-thought vectors (Kiros et al., 2015)
ResNet (He et al., 2015)
4.�Multimodal�Residual�Learning
Multimodal�Residual�Networks
Kim�et�al.,�NIPS�2016
A
LinearTanhLinear
TanhLinear
TanhLinear
Q V
H1
LinearTanhLinear
TanhLinear
TanhLinear
H2
V
LinearTanhLinear
TanhLinear
TanhLinear
H3
V
LinearSoftmax
⊙⊕
⊙⊕
⊙⊕
Softmax
5.�Results
5.�Results
TanhLinear
Linear
TanhLinear
Q V
Hl V
⊙⊕
(a)
LinearTanhLinear
Tanh
Linear
TanhLinear
Q V
Hl
V⊙⊕
(c)
LinearTanh
LinearTanhLinear
TanhLinear
TanhLinear
Q V
Hl V
⊙⊕
(b)
LinearTanhLinear
TanhLinear
TanhLinear
Q V
Hl
V
⊙⊕
(e)
LinearTanhLinear
TanhLinear
TanhLinear
Q V
Hl V
⊙⊕
(d)
if l=1
else
Identity
if l=1
Linearelse none
Table 1: The results of alternative models(a)-(e) on the test-dev.
Open-EndedAll Y/N Num. Other
(a) 60.17 81.83 38.32 46.61(b) 60.53 82.53 38.34 46.78(c) 60.19 81.91 37.87 46.70(d) 59.69 81.67 37.23 46.00(e) 60.20 81.98 38.25 46.57
Table 2: The e�ect of the visual features and# of target answers on the test-dev results.Vgg for VGG-19, and Res for ResNet-152 fea-tures described in Section 4.
Open-EndedAll Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78Vgg, 2k 60.79 82.13 38.87 47.52Vgg, 3k 60.68 82.40 38.69 47.10Res, 1k 61.45 82.36 38.40 48.81Res, 2k 61.68 82.28 38.82 49.25Res, 3k 61.47 82.28 39.09 48.76
5 Results
The VQA Challenge, which released the VQA dataset, provides evaluation servers for test-dev and test-standard test splits. For the test-dev, the evaluation server permits unlimitedsubmissions for validation, while the test-standard permits limited submissions for thecompetition. We report accuracies in percentage.
Alternative Models The test-dev results for the Open-Ended task on the of alternativemodels are shown in Table 1. (a) shows a significant improvement over SAN. However, (b) ismarginally better than (a). As compared to (b), (c) deteriorates the performance. An extraembedding for a question vector may easily cause overfitting leading the overall degradation.And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of thelinear mappings may e�ectively support to do the task.(e) shows a reasonable performance, however, the extra shortcut is not essential. Theempirical results seem to support this idea. The question-only model (50.39%) achieves acompetitive result to the joint models (57.75%), while the image-only model gets a pooraccuracy (28.13%) (see Table 2 in [2]).Eventually, we chose model (b) for the best performance and relative simplicity.
Number of Learning Blocks To confirm the e�ectiveness of the number of learningblocks selected via a pilot test (L = 3), we explore this on the chosen model (b), again. Asthe depth increases, the overall accuracies are 58.85 (L = 1), 59.44 (L = 2), 60.53 (L = 3)and 60.42 (L = 4).
Visual Features The ResNet-152 visual features are significantly better than VGG-19features for Other type in Table 2, even if the dimension of the ResNet features (2,048) is ahalf of VGG features’ (4,096). The ResNet visual features are also used in the previous work[9], however, our model acheives a remarkably better performance with a large margin (seeTable 3).
Number of Target Answers The number of target answers slightly a�ects the overallaccuracies having the trade-o� among answer types. So, the decision on the number of targetanswers is di�cult to be made. We simply chose Res, 2k in Table 2 based on the overallaccuracy.
Comparisons with State-of-the-arts Our chosen model significantly outperforms otherstate-of-the-art methods for both Open-Ended and Multiple-Choice tasks in Table 3. However,the performance of Number and Other types are still dissatisfied compared to Humanperformance, though the advances in the recent works was mainly for Other-type answers.This fact motivates to study on a counting mechanism in future work. The model comparisonis performed on the text-standard results.
6
Table 1: The results of alternative models(a)-(e) on the test-dev.
Open-EndedAll Y/N Num. Other
(a) 60.17 81.83 38.32 46.61(b) 60.53 82.53 38.34 46.78(c) 60.19 81.91 37.87 46.70(d) 59.69 81.67 37.23 46.00(e) 60.20 81.98 38.25 46.57
Table 2: The e�ect of the visual features and# of target answers on the test-dev results.Vgg for VGG-19, and Res for ResNet-152 fea-tures described in Section 4.
Open-EndedAll Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78Vgg, 2k 60.79 82.13 38.87 47.52Vgg, 3k 60.68 82.40 38.69 47.10Res, 1k 61.45 82.36 38.40 48.81Res, 2k 61.68 82.28 38.82 49.25Res, 3k 61.47 82.28 39.09 48.76
5 Results
The VQA Challenge, which released the VQA dataset, provides evaluation servers for test-dev and test-standard test splits. For the test-dev, the evaluation server permits unlimitedsubmissions for validation, while the test-standard permits limited submissions for thecompetition. We report accuracies in percentage.
Alternative Models The test-dev results for the Open-Ended task on the of alternativemodels are shown in Table 1. (a) shows a significant improvement over SAN. However, (b) ismarginally better than (a). As compared to (b), (c) deteriorates the performance. An extraembedding for a question vector may easily cause overfitting leading the overall degradation.And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of thelinear mappings may e�ectively support to do the task.(e) shows a reasonable performance, however, the extra shortcut is not essential. Theempirical results seem to support this idea. The question-only model (50.39%) achieves acompetitive result to the joint models (57.75%), while the image-only model gets a pooraccuracy (28.13%) (see Table 2 in [2]).Eventually, we chose model (b) for the best performance and relative simplicity.
Number of Learning Blocks To confirm the e�ectiveness of the number of learningblocks selected via a pilot test (L = 3), we explore this on the chosen model (b), again. Asthe depth increases, the overall accuracies are 58.85 (L = 1), 59.44 (L = 2), 60.53 (L = 3)and 60.42 (L = 4).
Visual Features The ResNet-152 visual features are significantly better than VGG-19features for Other type in Table 2, even if the dimension of the ResNet features (2,048) is ahalf of VGG features’ (4,096). The ResNet visual features are also used in the previous work[9], however, our model acheives a remarkably better performance with a large margin (seeTable 3).
Number of Target Answers The number of target answers slightly a�ects the overallaccuracies having the trade-o� among answer types. So, the decision on the number of targetanswers is di�cult to be made. We simply chose Res, 2k in Table 2 based on the overallaccuracy.
Comparisons with State-of-the-arts Our chosen model significantly outperforms otherstate-of-the-art methods for both Open-Ended and Multiple-Choice tasks in Table 3. However,the performance of Number and Other types are still dissatisfied compared to Humanperformance, though the advances in the recent works was mainly for Other-type answers.This fact motivates to study on a counting mechanism in future work. The model comparisonis performed on the text-standard results.
6
A Appendix
A.1 VQA test-dev Results
Table 1: The effects of various options for VQA test-dev. Here, the model of Figure 3a is used, sincethese experiments are preliminarily conducted. VGG-19 features and 1k target answers are used. s
stands for the usage of Skip-Thought Vectors [6] to initialize the question embedding model of GRU,b stands for the usage of Bayesian Dropout [3], and c stands for the usage of postprocessing usingimage captioning model [5].
Open-Ended Multiple-ChoiceAll Y/N Num. Other All Y/N Num. Other
baseline 58.97 81.11 37.63 44.90 63.53 81.13 38.91 54.06s 59.38 80.65 38.30 45.98 63.71 80.68 39.73 54.65s,b 59.74 81.75 38.13 45.84 64.15 81.77 39.54 54.67s,b,c 59.91 81.75 38.13 46.19 64.18 81.77 39.51 54.72
Table 2: The results for VQA test-dev. The precision of some accuracies [11, 2, 10] are one less thanothers, so, zero-filled to match others.
Open-Ended Multiple-ChoiceAll Y/N Num. Other All Y/N Num. Other
Question [1] 48.09 75.66 36.70 27.14 53.68 75.71 37.05 38.64Image [1] 28.13 64.01 00.42 03.77 30.53 69.87 00.45 03.76Q+I [1] 52.64 75.55 33.67 37.37 58.97 75.59 34.35 50.33LSTM Q [1] 48.76 78.20 35.68 26.59 54.75 78.22 36.82 38.78LSTM Q+I [1] 53.74 78.94 35.24 36.42 57.17 78.95 35.80 43.41Deep Q+I [7] 58.02 80.87 36.46 43.40 62.86 80.88 37.78 53.14
DPPnet [8] 57.22 80.71 37.24 41.69 62.48 80.79 38.94 52.16D-NMN [2] 57.90 80.50 37.40 43.10 - - - -SAN [11] 58.70 79.30 36.60 46.10 - - - -ACK [9] 59.17 81.01 38.42 45.23 - - - -FDA [4] 59.24 81.14 36.16 45.77 64.01 81.50 39.00 54.72DMN+ [10] 60.30 80.50 36.80 48.30 - - - -
Vgg, 1k 60.53 82.53 38.34 46.78 64.79 82.55 39.93 55.23Vgg, 2k 60.77 82.10 39.11 47.46 65.27 82.12 40.84 56.39Vgg, 3k 60.68 82.40 38.69 47.10 65.09 82.42 40.13 55.93Res, 1k 61.45 82.36 38.40 48.81 65.62 82.39 39.65 57.15Res, 2k 61.68 82.28 38.82 49.25 66.15 82.30 40.45 58.16Res, 3k 61.47 82.28 39.09 48.76 66.33 82.41 39.57 58.40
1
5.�ResultsTable 3: The effects of shortcut connections of MRN for VQA test-dev. ResNet-152 features and 2ktarget answers are used. MN stands for Multimodal Networks without residual learning, which doesnot have any shortcut connections. Dim. stands for common embedding vector’s dimension. Thenumber of parameters for word embedding (9.3M) and question embedding (21.8M) is subtractedfrom the total number of parameters in this table.
Open-EndedL Dim. #params All Y/N Num. Other
MN 1 4604 33.9M 60.33 82.50 36.04 46.89MN 2 2350 33.9M 60.90 81.96 37.16 48.28MN 3 1559 33.9M 59.87 80.55 37.53 47.25
MRN 1 3355 33.9M 60.09 81.78 37.09 46.78MRN 2 1766 33.9M 61.05 81.81 38.43 48.43MRN 3 1200 33.9M 61.68 82.28 38.82 49.25
MRN 4 851 33.9M 61.02 82.06 39.02 48.04
A.2 More Examples
Figure 1: More examples of Figure 4 in Section 5.2.
2
5.�Results
Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are oneless than others, so, zero-filled to match others.
Open-Ended Multiple-ChoiceAll Y/N Num. Other All Y/N Num. Other
DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79D-NMN [1] 58.00 - - - - - - -Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64SAN [30] 58.90 - - - - - - -ACK [27] 59.44 81.07 37.12 45.83 - - - -FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20DMN+ [28] 60.36 80.43 36.82 48.33 - - - -MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40Human [2] 83.30 95.77 83.39 72.67 - - - -
5.1 Visualization
In Equation 3, the left term ‡(Wqq) can be seen as a masking (attention) vector toselect a part of visual information. We assume that the di�erence between the right termV = ‡(W2‡(W1v)) and the masked vector F(q, v) indicates an attention e�ect caused bythe masking vector. Then, the attention e�ect Latt = 1
2 (V ≠ F)2 is visualized on the imageby calculating the gradient of Latt with respect to a given image I.
ˆLattˆI = ˆV
ˆI (V ≠ F) (6)
This technique can be applied to each learning block in a similar way.Since we use the preprocessed visual features, the pretrained CNN is augmented to reachthe input image only for this visualization. Note that model (b) in Table 1 is used for thisvisualization, and the pretrained VGG-19 is used for preprocessing and augmentation. Themodel is trained using the training set of VQA dataset, and visualized using the validationset. Examples are shown in Figure 4.Unlike the other works [30, 28] using explicit attention parameters, MRN does not useany explicit attentional mechanism. Though, we observe the interpretability of element-wise multiplication as a information masking, which leads to a novel method to visualizethe attention e�ect from this operation. Since MRN does not depend on a few attentionparameters (e.g. 14 ◊ 14), our visualization method shows a higher resolution than others[30, 28]. Based on this, we argue that MRN is an implicit attention model without explicitattention mechanism.
6 Conclusions
The idea of deep residual learning is applied to visual question-answering tasks. Based onthe two observations of the previous works, various alternative models are suggested andvalidated to propose the three-block layered MRN. Our model achieves the state-of-the-artresults on the VQA dataset for both Open-Ended and Multiple-Choice tasks. Moreover, weintroduce a novel method to visualize the spatial attention from the collapsed visual featuresusing back-propagation.
Acknowledgments
This work was supported by Naver Corp. and the Korea government (IITP-R0126-16-1072-SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF).
7
6.�Discussions
6.�Discussions
A
LinearTanhLinear
TanhLinear
TanhLinear
Q V
H1
LinearTanhLinear
TanhLinear
TanhLinear
H2
V
LinearTanhLinear
TanhLinear
TanhLinear
H3
V
LinearSoftmax
⊙⊕
⊙⊕
⊙⊕
Softmax
pretrained model
1st visualization
2nd visualization
3rd visualization
Visualization
6.�Discussionsexamples examples
What kind of animals are these ? sheep What animal is the picture ? elephant
What is this animal ? zebra What game is this person playing ? tennis
How many cats are here ? 2 What color is the bird ? yellow
What sport is this ? surfing Is the horse jumping ? yes
(a) (b)
(c) (d)
(e) (f)
(g) (h)
6.�Discussions
examples examples
What kind of animals are these ? sheep What animal is the picture ? elephant
What is this animal ? zebra What game is this person playing ? tennis
How many cats are here ? 2 What color is the bird ? yellow
What sport is this ? surfing Is the horse jumping ? yes
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Acknowledgments
This�work�was�supported�by�Naver�Corp.��and�partly�by�the�Korea�government�(IITP-R0126-16-1072-SW.StarLab,�KEIT-10044009-HRI.MESSI,�KEIT-10060086-
RISF,�ADD-UD130070ID-BMRR).
Recent�Works
Recent�Works�Low-rank�Bilinear�Pooling
Low-rank�Bilinear�Pooling
Bilinear�model�
Low-rank�Restriction�
Kim�et�al.,�arXiv�2016
fi = wijkk=1
M
∑j=1
N
∑ x jyk + bi = xTWiy + bi
fi = xTWiy + bi = x
TUiViTy + bi = 1
T (UiTx !Vi
Ty)+ bif = PT (UTx !VTy)+ b
N�x�D�D�x�M
rank(Wi)�<=�min(N,M)
Low-rank�Bilinear�Pooling
Bilinear�model�
Low-rank�Restriction�
Kim�et�al.,�arXiv�2016
fi = wijkk=1
M
∑j=1
N
∑ x jyk + bi = xTWiy + bi
fi = xTWiy + bi = x
TUiViTy + bi = 1
T (UiTx !Vi
Ty)+ bif = PT (UTx !VTy)+ b
vQ vI
⊙N�x�D�D�x�M
rank(Wi)�<=�min(N,M)
Low-rank�Bilinear�Pooling
Kim�et�al.,�arXiv�2016
x1 x2 … xN
WT
∑wixi ∑wjxj … ∑wkxk
Low-rank�Bilinear�Pooling
Kim�et�al.,�arXiv�2016
x1 x2 … xN
WxT
∑wixi ∑wjxj … ∑wkxk
y1 y2 … yN
WyT
∑wlyl ∑wmym … ∑wnyn
∑∑wiwlxiyl ∑∑wjwmxjym … ∑∑wkwnxkyn
Hadamard�Product�(Element-wise�Multiplication)
Recent�Works�Multimodal�Low-rank�Bilinear�Attention�Networks�(MLB)
MLB�Attention�Networks
Kim�et�al.,�arXiv�2016
A
TanhConv
TanhLinear
Replicate
Q V
SoftmaxConv
TanhLinear
TanhLinearLinear
Softmax
MLB�Attention�Networks�(MLB)
Kim�et�al.,�arXiv�2016
Under review as a conference paper at ICLR 2017
Table 2: The VQA test-standard results to compare with state-of-the-art. Notice that these resultsare trained by provided VQA train and validation splits, without any data augmentation.
Open-Ended MCMODEL ALL Y/N NUM ETC ALL
iBOWIMG (Zhou et al., 2015) 55.89 76.76 34.98 42.62 61.97DPPnet (Noh et al., 2015) 57.36 80.28 36.92 42.24 62.69Deeper LSTM+Normalized CNN (Lu et al., 2015) 58.16 80.56 36.53 43.73 63.09SMem (Xu & Saenko, 2016) 58.24 80.80 37.53 43.48 -Ask Your Neuron (Malinowski et al., 2016) 58.43 78.24 36.27 46.32 -SAN (Yang et al., 2015) 58.85 79.11 36.41 46.42 -D-NMN (Andreas et al., 2016) 59.44 80.98 37.48 45.81 -ACK (Wu et al., 2016b) 59.44 81.07 37.12 45.83 -FDA (Ilievski et al., 2016) 59.54 81.34 35.67 46.10 64.18HYBRID (Kafle & Kanan, 2016b) 60.06 80.34 37.82 47.56 -DMN+ (Xiong et al., 2016) 60.36 80.43 36.82 48.33 -MRN (Kim et al., 2016b) 61.84 82.39 38.23 49.41 66.33HieCoAtt (Lu et al., 2016) 62.06 79.95 38.22 51.95 66.07RAU (Noh & Han, 2016) 63.2 81.7 38.2 52.8 67.3
MLB (ours) 65.07 84.02 37.90 54.77 68.89
The rate of the divided answers is approximately 16.40%, and only 0.23% of questions have morethan two divided answers in VQA dataset. We assume that it eases the difficulty of convergencewithout severe degradation of performance.
Shortcut Connection The performance contribution of shortcut connections for residual learningis explored. This experiment is conducted based on the observation of the competitive performanceof single-block layered model. Since the usefulness of shortcut connections is linked to the networkdepth (He et al., 2015).
Data Augmentation The data augmentation with Visual Genome (Krishna et al., 2016) questionanswer annotations is explored. Visual Genome (Krishna et al., 2016) originally provides 1.7 Millionvisual question answer annotations. After aligning to VQA, the valid number of question-answeringpairs for training is 837,298, which is for distinct 99,280 images.
6 RESULTS
The six experiments are conducted sequentially to narrow down architectural choices. Each experi-ment determines experimental variables one by one. Refer to Table 1, which has six sectors dividedby mid-rules.
6.1 SIX EXPERIMENT RESULTS
Number of Learning Blocks Though, MRN (Kim et al., 2016b) has the three-block layered ar-chitecture, MARN shows the best performance with two-block layered models (63.92%). For themultiple glimpse models in the next experiment, we choose one-block layered model for its simplic-ity to extend, and competitive performance (63.79%).
Number of Glimpses Compared with the results of Fukui et al. (2016), four-glimpse MARN(64.61%) is better than other comparative models. However, for a parsimonious choice, two-glimpseMARN (64.53%) is chosen for later experiments. We speculate that multiple glimpses are one ofkey factors for the competitive performance of MCB (Fukui et al., 2016), based on a large margin inaccuracy, compared to one-glimpse MARN (63.79%).
6
MLB�Attention�Networks�(MLB)
Kim�et�al.,�arXiv�2016
Under review as a conference paper at ICLR 2017
Non-Linearity The results confirm that activation functions are useful to improve performances.Surprisingly, there is no empirical difference between two options, before-Hadamard product andafter-Hadamard product. This result may build a bridge to relate with studies on multiplicativeintegration with recurrent neural networks (Wu et al., 2016c).
Answer Sampling Sampled answers (64.80%) result better performance than mode answers(64.53%). It confirms that the distribution of answers from annotators can be used to improve theperformance. However, the number of multiple answers is usually limited due to the cost of datacollection.
Shortcut Connection Though, MRN (Kim et al., 2016b) effectively uses shortcut connectionsto improve model performance, one-block layered MARN shows better performance without theshortcut connection. In other words, the residual learning is not used in our proposed model, MLB.It seems that there is a trade-off between introducing attention mechanism and residual learning. Weleave a careful study on this trade-off for future work.
Data Augmentation Data augmentation using Visual Genome (Krishna et al., 2016) question an-swer annotations significantly improves the performance by 0.76% in accuracy for VQA test-devsplit. Especially, the accuracy of others (ETC)-type answers is notably improved from the dataaugmentation.
6.2 COMPARISON WITH STATE-OF-THE-ART
The comparison with other single models on VQA test-standard is shown in Table 2. The overallaccuracy of our model is approximately 1.9% above the next best model (Noh & Han, 2016) on theOpen-Ended task of VQA. The major improvements are from yes-or-no (Y/N) and others (ETC)-type answers. In Table 3, we also report the accuracy of our ensemble model to compare with otherensemble models on VQA test-standard, which won 1st to 5th places in VQA Challenge 20162. Webeat the previous state-of-the-art with a margin of 0.42%.
Table 3: The VQA test-standard results for ensemble models to compare with state-of-the-art. Forunpublished entries, their team names are used instead of their model names. Some of their figuresare updated after the challenge.
Open-Ended MCMODEL ALL Y/N NUM ETC ALL
RAU (Noh & Han, 2016) 64.12 83.33 38.02 53.37 67.34MRN (Kim et al., 2016b) 63.18 83.16 39.14 51.33 67.54DLAIT (not published) 64.83 83.23 40.80 54.32 68.30Naver Labs (not published) 64.79 83.31 38.70 54.79 69.26MCB (Fukui et al., 2016) 66.47 83.24 39.47 58.00 70.10
MLB (ours) 66.89 84.61 39.07 57.79 70.29Human (Antol et al., 2015) 83.30 95.77 83.39 72.67 91.54
7 RELATED WORKS
7.1 COMPACT BILINEAR POOLING
Compact bilinear pooling (Gao et al., 2015) approximates full bilinear pooling using a sampling-based computation, Tensor Sketch Projection (Charikar et al., 2002; Pham & Pagh, 2013):
(x⌦ y, h, s) = (x, h, s) ⇤ (y, h, s) (15)
= FFT�1(FFT( (x, h, s) � FFT( (y, h, s)) (16)
2http://visualqa.org/challenge.html
7
Recent�Works�DEMO
DEMO
Q:�아니�이게�뭐야?�
A:�냉장고�입니다.
DEMO
Q&A
Thank You