5
Top-Down Networks: A coarse-to-fine reimagination of CNNs Supplementary material Ioannis Lelekas Nergis Tomen Silvia L. Pintea Jan C. van Gemert Computer Vision Lab Delft University of Technology, Netherlands 1. Exp 2: Adversarial robustness The entire set of results for the adversarial robustness ex- periment is provided in figure 1. “ShiftsAttack” is a variant of Spatial attack [2], introducing only spatial shifts. TD networks exhibit enhanced robustness against attacks in- troducing correlated/uncorrelated noise, as well as against blurring attacks. Figure 2 presents the robustness results for the CIFAR10-augmented and the ResNet32 architecture vari- ants. Clearly, TD uni and TD rev variants exhibit enhanced robustness against spatial attacks, however, they also have similar to the BU behaviour against other attacks. This can be attributed to the increased number of filters at greater depth of the network, or equivalently increased scale of fea- ture maps, thus greater contribution of the finer scales to the final output. However, finer scales are much more vulner- able against attacks. All in all, the reversal of the BU net- work for the extraction of the TD variant is not solely effi- ciency driven, keeping a roughly fixed computational com- plexity across layers, but also contributes to the network’s robustness as well. Finally, we need to mention that the re- spective figure for the non-augmented CIFAR10 case tells the same story. Next, figure 3 presents the respective results for reintro- ducing the perturbation to two of the inputs of the network. Clearly, the highest and medium scale inputs are the most vulnerable ones, except for the simpler case of the MNIST dataset. The absence or scarce information in the high fre- quency region, yields the medium and lowest scale inputs as the ones with the highest impact. 2. Exp 3.(a): Explainability 2.1. Imagenette training Imagenette [5] is a 10-class sub-problem of ImageNet [1], allowing experimentation with a more realistic task, without the high training times and computational costs re- quired for training on a large scale dataset. A set of exam- ples, along with their corresponding labels are provided in figure 4. The datasets contains a total of 9469, 3925 train- ing and validation samples respectively. Training samples were resized to 156 × 156, from where random 128 × 128 crops were extracted; validation samples were resized to 128 × 128. We utilized a lighter version 1 of the ResNet18 architec- ture introduced in [3] for Imagenette training, as this is a 10- class sub-problem, incorporating the pre-activation unit of [4]. Additionally, the stride s and the kernel extent k of the first convolution for depth initialization were set to s =1 and k =3 respectively. Regarding training, a 128 × 128 crop is extracted from the original image, or its horizontal flip, while subtracting the per-pixel mean [6]; the color aug- mentation of [6] is also used. For the BU network a batch size of 128 is used and the network is trained for a total of 50 epochs with a starting learning rate of 0.1. As for the TD, increased memory footprint led to the reduction of the batch size to 64 and the adaptation of the starting learning rate and the total epochs to 0.05 and 80. We trained with SGD with momentum of 0.9 and a weight decay of 0.001; we also adopted a 3-stage learning rate decay scheme, where the learning rate is divided by 10 at 50% and 80% of the to- tal epochs. Regarding performance, BU outperformed the TD variant by roughly 4%. Grad-CAM is finally utilized for generating class-discriminate localization maps of the most informative features. 2.2. Grad-CAM heatmap visualizations Figure 5 displays some additional Grad-CAM visualiza- tions. The visualizations are obtained by using a ResNet18 architecture for the BU networks and its corresponding TD variant. The original images are taken from the Imagenette dataset [5]. The TD model provides localized activations, focusing on certain informative aspects of the image, while the BU model focuses on large connected areas. Because of this 1 dividing the filters of the original architecture by 2. 1

Top-Down Networks: A coarse-to-fine reimagination of CNNs … · 2020-06-11 · Delft University of Technology, Netherlands 1. Exp 2: Adversarial robustness The entire set of results

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Top-Down Networks: A coarse-to-fine reimagination of CNNs … · 2020-06-11 · Delft University of Technology, Netherlands 1. Exp 2: Adversarial robustness The entire set of results

Top-Down Networks: A coarse-to-fine reimagination of CNNsSupplementary material

Ioannis Lelekas Nergis Tomen Silvia L. Pintea Jan C. van Gemert

Computer Vision LabDelft University of Technology, Netherlands

1. Exp 2: Adversarial robustnessThe entire set of results for the adversarial robustness ex-

periment is provided in figure 1. “ShiftsAttack” is a variantof Spatial attack [2], introducing only spatial shifts. TDnetworks exhibit enhanced robustness against attacks in-troducing correlated/uncorrelated noise, as well as againstblurring attacks.

Figure 2 presents the robustness results for theCIFAR10-augmented and the ResNet32 architecture vari-ants. Clearly, TDuni and TDrev variants exhibit enhancedrobustness against spatial attacks, however, they also havesimilar to the BU behaviour against other attacks. This canbe attributed to the increased number of filters at greaterdepth of the network, or equivalently increased scale of fea-ture maps, thus greater contribution of the finer scales to thefinal output. However, finer scales are much more vulner-able against attacks. All in all, the reversal of the BU net-work for the extraction of the TD variant is not solely effi-ciency driven, keeping a roughly fixed computational com-plexity across layers, but also contributes to the network’srobustness as well. Finally, we need to mention that the re-spective figure for the non-augmented CIFAR10 case tellsthe same story.

Next, figure 3 presents the respective results for reintro-ducing the perturbation to two of the inputs of the network.Clearly, the highest and medium scale inputs are the mostvulnerable ones, except for the simpler case of the MNISTdataset. The absence or scarce information in the high fre-quency region, yields the medium and lowest scale inputsas the ones with the highest impact.

2. Exp 3.(a): Explainability2.1. Imagenette training

Imagenette [5] is a 10-class sub-problem of ImageNet[1], allowing experimentation with a more realistic task,without the high training times and computational costs re-quired for training on a large scale dataset. A set of exam-

ples, along with their corresponding labels are provided infigure 4. The datasets contains a total of 9469, 3925 train-ing and validation samples respectively. Training sampleswere resized to 156 × 156, from where random 128 × 128crops were extracted; validation samples were resized to128× 128.

We utilized a lighter version1 of the ResNet18 architec-ture introduced in [3] for Imagenette training, as this is a 10-class sub-problem, incorporating the pre-activation unit of[4]. Additionally, the stride s and the kernel extent k of thefirst convolution for depth initialization were set to s = 1and k = 3 respectively. Regarding training, a 128 × 128crop is extracted from the original image, or its horizontalflip, while subtracting the per-pixel mean [6]; the color aug-mentation of [6] is also used. For the BU network a batchsize of 128 is used and the network is trained for a total of 50epochs with a starting learning rate of 0.1. As for the TD,increased memory footprint led to the reduction of the batchsize to 64 and the adaptation of the starting learning rateand the total epochs to 0.05 and 80. We trained with SGDwith momentum of 0.9 and a weight decay of 0.001; wealso adopted a 3-stage learning rate decay scheme, wherethe learning rate is divided by 10 at 50% and 80% of the to-tal epochs. Regarding performance, BU outperformed theTD variant by roughly 4%. Grad-CAM is finally utilizedfor generating class-discriminate localization maps of themost informative features.

2.2. Grad-CAM heatmap visualizations

Figure 5 displays some additional Grad-CAM visualiza-tions. The visualizations are obtained by using a ResNet18architecture for the BU networks and its corresponding TDvariant. The original images are taken from the Imagenettedataset [5].

The TD model provides localized activations, focusingon certain informative aspects of the image, while the BUmodel focuses on large connected areas. Because of this

1dividing the filters of the original architecture by 2.

1

Page 2: Top-Down Networks: A coarse-to-fine reimagination of CNNs … · 2020-06-11 · Delft University of Technology, Netherlands 1. Exp 2: Adversarial robustness The entire set of results

0.0 0.2 0.4 0.6 0.8 1.00

20

40

60

80

100

Sin

gle

Pix

elA

ttack

MNIST

LeNetFC

LeNetFC_TD

NIN-light

NIN-light_TD

0.0 0.2 0.4 0.6 0.8 1.0

Fashion-MNIST

LeNetFC

LeNetFC_TD

NIN-light

NIN-light_TD

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75

CIFAR10

ResNet32

ResNet32_TD

NIN

NIN_TD

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75

CIFAR10_aug

ResNet32

ResNet32_TD

NIN

NIN_TD

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.00

20

40

60

80

100

Salt

&Pep

perN

ois

eA

ttack

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.00

20

40

60

80

100

Poin

twis

eA

ttack

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 2 4 6 8 10 12 14 0 5 10 15 20

0 2 4 6 8 10 120

20

40

60

80

100

Ad

dit

iveG

auss

ianN

ois

eA

ttack

0 2 4 6 8 10 0 5 10 15 20 0 5 10 15 20

0 2 4 6 8 10 120

20

40

60

80

100

Ad

dit

iveU

niform

Nois

eA

ttack

0 2 4 6 8 10 12 0 5 10 15 20 0 5 10 15 20

0 2 4 6 8 10 120

20

40

60

80

100

Ble

nd

ed

Uniform

Nois

eA

ttack

0 2 4 6 8 10 12 14 0 5 10 15 20 25 0 5 10 15 20 25

0 2 4 6 8 100

20

40

60

80

100

Gauss

ianB

lurA

ttack

0 2 4 6 8 10 12 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 5 10 15 20

0 2 4 6 80

20

40

60

80

100

Contr

ast

Red

uct

ionA

ttack

0 2 4 6 8 10 12 0 5 10 15 20 25 0 5 10 15 20 25

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.50

20

40

60

80

100

Sp

ati

alA

ttack

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

L2 bound

0

20

40

60

80

100

Shifts

Att

ack

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

L2 bound0 5 10 15 20 25 30 35 40

L2 bound0 5 10 15 20 25 30 35

L2 bound

corr

ela

ted

unco

rrela

ted

blu

rrin

gsp

ati

al

Figure 1. Exp:2 Complete set of results for the second experiment. Plots of test accuracy loss versus the L2 distance between origi-nal and perturbed input, where each column corresponds to a different task. TD networks exhibit enhanced robustness against corre-lated/uncorrelated noise and blurring attacks.

2

Page 3: Top-Down Networks: A coarse-to-fine reimagination of CNNs … · 2020-06-11 · Delft University of Technology, Netherlands 1. Exp 2: Adversarial robustness The entire set of results

0.0 0.5 1.0 1.50

20

40

60

80

100

Test

acc

urac

y lo

ss (%

)

SinglePixelAttackResNet32ResNet32_TDResNet32_TD_uniResNet32_TD_rev

0 10 20 30

Salt&PepperNoiseAttack

0 5 10 15

PointwiseAttack

0 5 10 15

AdditiveGaussianNoiseAttack

0 5 10 15 20

AdditiveUniformNoiseAttack

0 10 20L2 bound

0

20

40

60

80

100

Test

acc

urac

y lo

ss (%

)

BlendedUniformNoiseAttack

0 5 10 15 20L2 bound

GaussianBlurAttack

0 10 20L2 bound

ContrastReductionAttack

0 10 20 30 40L2 bound

SpatialAttack

0 10 20 30L2 bound

ShiftsAttack

Figure 2. Exp 2: Test accuracy loss versus the L2 distance between original and perturbed input, for the CIFAR10-augmented and theResNet32 architectures. Robustness is enhanced for the spatial attacks, but in general TDuni and TDrev variants exhibit similar behaviourto the BU baseline, which can be attributed to the increased filters at deeper layers.

high-medium high-low medium-low0

20

40

60

80

100

Test

acc

uray

(%)

NIN-light_TD MNIST

OriginalSinglePixelAttackSalt&PepperNoiseAttackPointwiseAttackAdditiveGaussianNoiseAttackAdditiveUniformNoiseAttackBlendedUniformNoiseAttackGaussianBlurAttackContrastReductionAttackSpatialAttackShiftsAttack

high-medium high-low medium-low

NIN-light_TD Fashion-MNIST

high-medium high-low medium-low

NIN_TD CIFAR10_aug

Figure 3. Exp 2: Reintroducing perturbations to two of the inputs of TD model when using a NIN-light backbone for MNIST and Fashion-MNIST, and the NIN backbone for CIFAR10. Clearly, perturbing the two highest scale inputs, “high-medium” has the highest impact.Regarding the case of the simpler MNIST and the information gathered in the low to mid frequency region, the medium and the lowestscale input have the highest impact instead.

difference we believe the TD model may be more precisethan the BU model for tasks such as weakly-supervised ob-ject detection.

References

[1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei. Imagenet: A large-scale hierarchical image database.In Conference on Computer Vision and Pattern Recognition,2009.

[2] Logan Engstrom, Brandon Tran, Dimitris Tsipras, LudwigSchmidt, and Aleksander Madry. Exploring the landscape ofspatial robustness. CoRR, 2017.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE conference on Computer Vision and PatternRecognition, pages 770–778, 2016.

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks. In Europeanconference on Computer Vision, pages 630–645, 2016.

[5] FastAI Jeremy Howard. The imagenette dataset. https://github.com/fastai/imagenette.

[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im-agenet classification with deep convolutional neural networks.In Advances in neural information processing systems, 2012.

3

Page 4: Top-Down Networks: A coarse-to-fine reimagination of CNNs … · 2020-06-11 · Delft University of Technology, Netherlands 1. Exp 2: Adversarial robustness The entire set of results

0

20

40

60

80

100

120

French horn parachute cassette player' golf ball'

0

20

40

60

80

100

120

English springer golf ball' church garbage truck

0

20

40

60

80

100

120

gas pump cassette player' chain saw tench

0 50 100

0

20

40

60

80

100

120

French horn

0 50 100

gas pump

0 50 100

tench

0 50 100

French horn

Figure 4. Exp 3.(a): Validation samples from the Imagenette dataset [5], along with their corresponding ground truth labels. Samples areresized to 128× 128.

4

Page 5: Top-Down Networks: A coarse-to-fine reimagination of CNNs … · 2020-06-11 · Delft University of Technology, Netherlands 1. Exp 2: Adversarial robustness The entire set of results

Input

BU

TD

Input

BU

TD

Figure 5. Exp 3.(a): Grad-CAM heatmaps visualization on validation images from the Imagenette dataset [5], using a ResNet18 architec-ture for BU and its corresponding TD variant. All images are correctly classified. Rows 1 and 4 show the original Imagenette images;rows 2 and 5 show the BU heatmaps, while rows 3 and 6 visualize the TD heatmaps. Focusing on local information rather than globalinformation, may help the TD to be more precise for object detection than the BU model.

5