Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Top-Down Networks: A coarse-to-fine reimagination of CNNsSupplementary material
Ioannis Lelekas Nergis Tomen Silvia L. Pintea Jan C. van Gemert
Computer Vision LabDelft University of Technology, Netherlands
1. Exp 2: Adversarial robustnessThe entire set of results for the adversarial robustness ex-
periment is provided in figure 1. “ShiftsAttack” is a variantof Spatial attack [2], introducing only spatial shifts. TDnetworks exhibit enhanced robustness against attacks in-troducing correlated/uncorrelated noise, as well as againstblurring attacks.
Figure 2 presents the robustness results for theCIFAR10-augmented and the ResNet32 architecture vari-ants. Clearly, TDuni and TDrev variants exhibit enhancedrobustness against spatial attacks, however, they also havesimilar to the BU behaviour against other attacks. This canbe attributed to the increased number of filters at greaterdepth of the network, or equivalently increased scale of fea-ture maps, thus greater contribution of the finer scales to thefinal output. However, finer scales are much more vulner-able against attacks. All in all, the reversal of the BU net-work for the extraction of the TD variant is not solely effi-ciency driven, keeping a roughly fixed computational com-plexity across layers, but also contributes to the network’srobustness as well. Finally, we need to mention that the re-spective figure for the non-augmented CIFAR10 case tellsthe same story.
Next, figure 3 presents the respective results for reintro-ducing the perturbation to two of the inputs of the network.Clearly, the highest and medium scale inputs are the mostvulnerable ones, except for the simpler case of the MNISTdataset. The absence or scarce information in the high fre-quency region, yields the medium and lowest scale inputsas the ones with the highest impact.
2. Exp 3.(a): Explainability2.1. Imagenette training
Imagenette [5] is a 10-class sub-problem of ImageNet[1], allowing experimentation with a more realistic task,without the high training times and computational costs re-quired for training on a large scale dataset. A set of exam-
ples, along with their corresponding labels are provided infigure 4. The datasets contains a total of 9469, 3925 train-ing and validation samples respectively. Training sampleswere resized to 156 × 156, from where random 128 × 128crops were extracted; validation samples were resized to128× 128.
We utilized a lighter version1 of the ResNet18 architec-ture introduced in [3] for Imagenette training, as this is a 10-class sub-problem, incorporating the pre-activation unit of[4]. Additionally, the stride s and the kernel extent k of thefirst convolution for depth initialization were set to s = 1and k = 3 respectively. Regarding training, a 128 × 128crop is extracted from the original image, or its horizontalflip, while subtracting the per-pixel mean [6]; the color aug-mentation of [6] is also used. For the BU network a batchsize of 128 is used and the network is trained for a total of 50epochs with a starting learning rate of 0.1. As for the TD,increased memory footprint led to the reduction of the batchsize to 64 and the adaptation of the starting learning rateand the total epochs to 0.05 and 80. We trained with SGDwith momentum of 0.9 and a weight decay of 0.001; wealso adopted a 3-stage learning rate decay scheme, wherethe learning rate is divided by 10 at 50% and 80% of the to-tal epochs. Regarding performance, BU outperformed theTD variant by roughly 4%. Grad-CAM is finally utilizedfor generating class-discriminate localization maps of themost informative features.
2.2. Grad-CAM heatmap visualizations
Figure 5 displays some additional Grad-CAM visualiza-tions. The visualizations are obtained by using a ResNet18architecture for the BU networks and its corresponding TDvariant. The original images are taken from the Imagenettedataset [5].
The TD model provides localized activations, focusingon certain informative aspects of the image, while the BUmodel focuses on large connected areas. Because of this
1dividing the filters of the original architecture by 2.
1
0.0 0.2 0.4 0.6 0.8 1.00
20
40
60
80
100
Sin
gle
Pix
elA
ttack
MNIST
LeNetFC
LeNetFC_TD
NIN-light
NIN-light_TD
0.0 0.2 0.4 0.6 0.8 1.0
Fashion-MNIST
LeNetFC
LeNetFC_TD
NIN-light
NIN-light_TD
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75
CIFAR10
ResNet32
ResNet32_TD
NIN
NIN_TD
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75
CIFAR10_aug
ResNet32
ResNet32_TD
NIN
NIN_TD
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.00
20
40
60
80
100
Salt
&Pep
perN
ois
eA
ttack
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.00
20
40
60
80
100
Poin
twis
eA
ttack
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 2 4 6 8 10 12 14 0 5 10 15 20
0 2 4 6 8 10 120
20
40
60
80
100
Ad
dit
iveG
auss
ianN
ois
eA
ttack
0 2 4 6 8 10 0 5 10 15 20 0 5 10 15 20
0 2 4 6 8 10 120
20
40
60
80
100
Ad
dit
iveU
niform
Nois
eA
ttack
0 2 4 6 8 10 12 0 5 10 15 20 0 5 10 15 20
0 2 4 6 8 10 120
20
40
60
80
100
Ble
nd
ed
Uniform
Nois
eA
ttack
0 2 4 6 8 10 12 14 0 5 10 15 20 25 0 5 10 15 20 25
0 2 4 6 8 100
20
40
60
80
100
Gauss
ianB
lurA
ttack
0 2 4 6 8 10 12 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 5 10 15 20
0 2 4 6 80
20
40
60
80
100
Contr
ast
Red
uct
ionA
ttack
0 2 4 6 8 10 12 0 5 10 15 20 25 0 5 10 15 20 25
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.50
20
40
60
80
100
Sp
ati
alA
ttack
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
L2 bound
0
20
40
60
80
100
Shifts
Att
ack
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
L2 bound0 5 10 15 20 25 30 35 40
L2 bound0 5 10 15 20 25 30 35
L2 bound
corr
ela
ted
unco
rrela
ted
blu
rrin
gsp
ati
al
Figure 1. Exp:2 Complete set of results for the second experiment. Plots of test accuracy loss versus the L2 distance between origi-nal and perturbed input, where each column corresponds to a different task. TD networks exhibit enhanced robustness against corre-lated/uncorrelated noise and blurring attacks.
2
0.0 0.5 1.0 1.50
20
40
60
80
100
Test
acc
urac
y lo
ss (%
)
SinglePixelAttackResNet32ResNet32_TDResNet32_TD_uniResNet32_TD_rev
0 10 20 30
Salt&PepperNoiseAttack
0 5 10 15
PointwiseAttack
0 5 10 15
AdditiveGaussianNoiseAttack
0 5 10 15 20
AdditiveUniformNoiseAttack
0 10 20L2 bound
0
20
40
60
80
100
Test
acc
urac
y lo
ss (%
)
BlendedUniformNoiseAttack
0 5 10 15 20L2 bound
GaussianBlurAttack
0 10 20L2 bound
ContrastReductionAttack
0 10 20 30 40L2 bound
SpatialAttack
0 10 20 30L2 bound
ShiftsAttack
Figure 2. Exp 2: Test accuracy loss versus the L2 distance between original and perturbed input, for the CIFAR10-augmented and theResNet32 architectures. Robustness is enhanced for the spatial attacks, but in general TDuni and TDrev variants exhibit similar behaviourto the BU baseline, which can be attributed to the increased filters at deeper layers.
high-medium high-low medium-low0
20
40
60
80
100
Test
acc
uray
(%)
NIN-light_TD MNIST
OriginalSinglePixelAttackSalt&PepperNoiseAttackPointwiseAttackAdditiveGaussianNoiseAttackAdditiveUniformNoiseAttackBlendedUniformNoiseAttackGaussianBlurAttackContrastReductionAttackSpatialAttackShiftsAttack
high-medium high-low medium-low
NIN-light_TD Fashion-MNIST
high-medium high-low medium-low
NIN_TD CIFAR10_aug
Figure 3. Exp 2: Reintroducing perturbations to two of the inputs of TD model when using a NIN-light backbone for MNIST and Fashion-MNIST, and the NIN backbone for CIFAR10. Clearly, perturbing the two highest scale inputs, “high-medium” has the highest impact.Regarding the case of the simpler MNIST and the information gathered in the low to mid frequency region, the medium and the lowestscale input have the highest impact instead.
difference we believe the TD model may be more precisethan the BU model for tasks such as weakly-supervised ob-ject detection.
References
[1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei. Imagenet: A large-scale hierarchical image database.In Conference on Computer Vision and Pattern Recognition,2009.
[2] Logan Engstrom, Brandon Tran, Dimitris Tsipras, LudwigSchmidt, and Aleksander Madry. Exploring the landscape ofspatial robustness. CoRR, 2017.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE conference on Computer Vision and PatternRecognition, pages 770–778, 2016.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks. In Europeanconference on Computer Vision, pages 630–645, 2016.
[5] FastAI Jeremy Howard. The imagenette dataset. https://github.com/fastai/imagenette.
[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im-agenet classification with deep convolutional neural networks.In Advances in neural information processing systems, 2012.
3
0
20
40
60
80
100
120
French horn parachute cassette player' golf ball'
0
20
40
60
80
100
120
English springer golf ball' church garbage truck
0
20
40
60
80
100
120
gas pump cassette player' chain saw tench
0 50 100
0
20
40
60
80
100
120
French horn
0 50 100
gas pump
0 50 100
tench
0 50 100
French horn
Figure 4. Exp 3.(a): Validation samples from the Imagenette dataset [5], along with their corresponding ground truth labels. Samples areresized to 128× 128.
4
Input
BU
TD
Input
BU
TD
Figure 5. Exp 3.(a): Grad-CAM heatmaps visualization on validation images from the Imagenette dataset [5], using a ResNet18 architec-ture for BU and its corresponding TD variant. All images are correctly classified. Rows 1 and 4 show the original Imagenette images;rows 2 and 5 show the BU heatmaps, while rows 3 and 6 visualize the TD heatmaps. Focusing on local information rather than globalinformation, may help the TD to be more precise for object detection than the BU model.
5