10
On the Connection Between Adversarial Robustness and Saliency Map Interpretability Christian Etmann *12 Sebastian Lunz *3 Peter Maass 1 Carola-Bibiane Sch¨ onlieb 3 Abstract Recent studies on the adversarial vulnerability of neural networks have shown that models trained to be more robust to adversarial attacks exhibit more interpretable saliency maps than their non- robust counterparts. We aim to quantify this be- havior by considering the alignment between in- put image and saliency map. We hypothesize that as the distance to the decision boundary grows, so does the alignment. This connection is strictly true in the case of linear models. We confirm these theoretical findings with experiments based on models trained with a local Lipschitz regular- ization and identify where the non-linear nature of neural networks weakens the relation. 1. Introduction Despite impressive results in a variety of classification tasks (LeCun et al., 2015), even highly accurate neural network classifiers are plagued by a vulnerability to so-called adver- sarial perturbations (Szegedy et al., 2014). These adver- sarial perturbations are small, often visually imperceptible perturbations to the network’s input, which however result in the network’s classification decision being changed. Such vulnerabilities may pose a threat to real-world deployments of automated recognition systems, especially in security- critical applications such as autonomous driving or banking. This has sparked a large number of publications related to both the creation of adversarial attacks (Goodfellow et al., 2014; Kurakin et al., 2016; Moosavi-Dezfooli et al., 2016) as well as defenses against these (see (Schott et al., 2018) for an overview). Apart from the application-focused viewpoint, the observed adversarial vulnerability offers non-obvious * Equal contribution 1 Center for Industrial Mathematics, University of Bremen, Bremen, Germany 2 Work done at DAMTP, Cambridge. 3 DAMTP, University of Cambridge, Cambridge, United Kingdom. Correspondence to: Chris- tian Etmann <[email protected]>, Sebastian Lunz <[email protected]>. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Figure 1. An image of a dog (left), the saliency maps of a highly non-adversarially-robust neural network (middle) and of a more robust network (right). We observe that the robust network gives a much clearer indication of what the classifier deems to be discrim- inative features. Details about saliency and the robustification are given in section 4. Most figures are best viewed on a screen. insights into the inner workings of neural networks. One particular method of defense is adversarial training (Madry et al., 2018), which aims to minimize a modified training objective. While this method – like all known approaches of defense – decreases the accuracy of the classifier, it is also successful in increasing the robustness to adversarial attacks, i.e. the perturbations need to be larger on average in order to change the classification decision. (Tsipras et al., 2019) also notice that networks that are ro- bustified in this way show interesting phenomena, which so far could not be explained. Neural networks usually exhibit very unstructured saliency maps (gradients of a classifier score with respect to the network’s input (Simonyan et al., 2013)) which barely relate to the input image. On the other hand, saliency maps of robustified classifiers tend to be far more interpretable, in that structures in the input image also emerge in the corresponding saliency map, as exemplified in Figure 1.(Tsipras et al., 2019) describe this as an ’unex- pected benefit’ of adversarial robustness. In order to obtain a semantically meaningful visualization of the network’s clas- sification decision in non-robustified networks, the saliency map has to be aggregated over many different points in the vicinity of the input image. This can be achieved either via averaging saliency maps of noisy versions of the image (Smilkov et al., 2017) or by integrating along a path (Sun- dararajan et al., 2017). Other approaches typically employ modified backpropagation schemes in order to highlight the discriminative portions of the image. Examples of this in- clude guided backpropagation (Springenberg et al., 2015)

On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency MapInterpretability

Christian Etmann * 1 2 Sebastian Lunz * 3 Peter Maass 1 Carola-Bibiane Schonlieb 3

AbstractRecent studies on the adversarial vulnerability ofneural networks have shown that models trainedto be more robust to adversarial attacks exhibitmore interpretable saliency maps than their non-robust counterparts. We aim to quantify this be-havior by considering the alignment between in-put image and saliency map. We hypothesize thatas the distance to the decision boundary grows,so does the alignment. This connection is strictlytrue in the case of linear models. We confirmthese theoretical findings with experiments basedon models trained with a local Lipschitz regular-ization and identify where the non-linear natureof neural networks weakens the relation.

1. IntroductionDespite impressive results in a variety of classification tasks(LeCun et al., 2015), even highly accurate neural networkclassifiers are plagued by a vulnerability to so-called adver-sarial perturbations (Szegedy et al., 2014). These adver-sarial perturbations are small, often visually imperceptibleperturbations to the network’s input, which however resultin the network’s classification decision being changed. Suchvulnerabilities may pose a threat to real-world deploymentsof automated recognition systems, especially in security-critical applications such as autonomous driving or banking.This has sparked a large number of publications related toboth the creation of adversarial attacks (Goodfellow et al.,2014; Kurakin et al., 2016; Moosavi-Dezfooli et al., 2016)as well as defenses against these (see (Schott et al., 2018) foran overview). Apart from the application-focused viewpoint,the observed adversarial vulnerability offers non-obvious

*Equal contribution 1Center for Industrial Mathematics,University of Bremen, Bremen, Germany 2Work done atDAMTP, Cambridge. 3DAMTP, University of Cambridge,Cambridge, United Kingdom. Correspondence to: Chris-tian Etmann <[email protected]>, Sebastian Lunz<[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

Figure 1. An image of a dog (left), the saliency maps of a highlynon-adversarially-robust neural network (middle) and of a morerobust network (right). We observe that the robust network gives amuch clearer indication of what the classifier deems to be discrim-inative features. Details about saliency and the robustification aregiven in section 4. Most figures are best viewed on a screen.

insights into the inner workings of neural networks. Oneparticular method of defense is adversarial training (Madryet al., 2018), which aims to minimize a modified trainingobjective. While this method – like all known approachesof defense – decreases the accuracy of the classifier, it isalso successful in increasing the robustness to adversarialattacks, i.e. the perturbations need to be larger on averagein order to change the classification decision.

(Tsipras et al., 2019) also notice that networks that are ro-bustified in this way show interesting phenomena, which sofar could not be explained. Neural networks usually exhibitvery unstructured saliency maps (gradients of a classifierscore with respect to the network’s input (Simonyan et al.,2013)) which barely relate to the input image. On the otherhand, saliency maps of robustified classifiers tend to be farmore interpretable, in that structures in the input image alsoemerge in the corresponding saliency map, as exemplifiedin Figure 1. (Tsipras et al., 2019) describe this as an ’unex-pected benefit’ of adversarial robustness. In order to obtain asemantically meaningful visualization of the network’s clas-sification decision in non-robustified networks, the saliencymap has to be aggregated over many different points in thevicinity of the input image. This can be achieved eithervia averaging saliency maps of noisy versions of the image(Smilkov et al., 2017) or by integrating along a path (Sun-dararajan et al., 2017). Other approaches typically employmodified backpropagation schemes in order to highlight thediscriminative portions of the image. Examples of this in-clude guided backpropagation (Springenberg et al., 2015)

Page 2: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

and deep Taylor decomposition (Montavon et al., 2017).In this paper, we show that the interpretability of the saliencymaps of a robustified neural network is not only a side-effectof adversarial training, but a general property enjoyed bynetworks with a high degree of robustness to adversarialperturbations. We first demonstrate this principle for thecase of a linear, binary classifier and show that the ’inter-pretability’ is due to the image vector and the respectiveimage gradient aligning. For the more general, non-linearcase we empirically show that while this relationship is trueon average, the linear theory and the non-linear reality donot always agree. We empirically demonstrate that the morelinear the model is, the stronger the connection betweenrobustness and alignment becomes.

2. Adversarial Robustness and Saliency MapsSince adversarial perturbations are small perturbations thatchange the predicted class of a neural network, it makessense to define the robustness towards adversarial perturba-tions via the distance of the unperturbed image to its nearestperturbed image, such that the classification is changed.

Definition 1. Let F : X → C (with C finite) be a classifierover the normed vector space (X, ‖ · ‖). We call

ρ(x) = infe∈X{‖e‖ : F (x+ e) 6= F (x)} (1)

the (adversarial) robustness of F in the point x. We callEx∼D [ρ(x)] the (adversarial) robustness of F over the dis-tribution D.

Put differently, the robustness of a classifier in a point isnothing but the distance to its closest decision boundary.Margin classifiers like support vector machines (Cortes &Vapnik, 1995) seek to keep this distance large for the trainingset, usually in order to avoid overfitting. (Sokolic et al.,2017) and (Elsayed et al., 2018) also apply this principleto neural networks via regularization schemes. We pointout that our definition of adversarial robustness does notdepend on the ground truth class label and – given feasiblecomputability – can approximately be calculated even onunlabelled data.In the following, we will always assume X to be a real,finite-dimensional vector space with the Euclidean norm.The proofs for the following theoretical statements are foundin the appendix.

2.1. A Motivating Toy Example

We consider the toy case of a linear binary classifierF (x) = sgn(Ψz(x)) with the so-called score functionΨz(x) = 〈x, z〉 and fixed z 6= 0, where 〈·, ·〉 denotes thestandard inner product on Rm. A straightforward calcula-tion (see appendix) shows that the adversarial robustness of

F is given by

ρ(x) =|〈x, z〉|‖z‖

=|〈x,∇Ψz(x)〉|‖∇Ψz(x)‖

. (2)

Unless stated otherwise, we will always denote with ∇ thegradient with respect to x. Note that ρ(x) = ‖x‖ · | cos(δ)|,where δ is the angle between the vectors x and ∇Ψz(x).This implies that ρ(x) grows with the alignment of x and zand is maximized if and only if x and z are collinear.This motivates the following definition.

Definition 2 (Alignment). Let the binary classifier

F : X → {−1, 1}

be defined a.e. by F (x) = sgn(Ψ(x)), where Ψ : X → Ris differentiable in x. We then call ∇Ψ the saliency map ofF with respect to Ψ in x and

α(x) :=|〈x,∇Ψ(x)〉|‖∇Ψ(x)‖

, (3)

the alignment with respect to Ψ in x.

The alignment is a measure of how similar the input imagex and the saliency map∇Ψ(x) are. If ‖x‖ = 1, and x and∇Ψ(x) are zero-centered, this coincides with the absolutevalue of their Pearson correlation. For a linear binary classi-fier, the alignment trivially increases with the robustness ofthe classifier.

Generalizing from the linear to the affine case leads to a clas-sifier of the form F (x) = sgn(〈x, z〉+b), whose robustnessin x is

ρ(x) =|〈x, z〉+ b|‖z‖

.

In this case the robustness and alignment do not coincideanymore. In order to connect these two diverging concepts,we offer two alternative viewpoints. On the one hand, wecan trivially bound the robustness via the triangle inequality

ρ(x) ≤ α(x) +|b|‖z‖

. (4)

This is particularly meaningful if |b|/‖z‖ is small in compar-ison to α(x). Alternatively, one can connect the robustnessto the alignment at a different point ξ = x+ b

‖z‖z‖z‖ , leading

to the relation

ρ(x) = α(ξ). (5)

In the affine case this approach simply amounts to a shift ofthe data that is uniform over all data points x. We will seehow these two viewpoints lead to different bounds in thenon-linear case later.

Page 3: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

2.2. The General Case

We now consider the general, n-class case.Definition 3 (Alignment, Multi-Class Case). Let

Ψ = (Ψ1, . . . ,Ψn) : X → Rn

be differentiable in x. Then for an n-class classifier defineda.e. by

F (x) = arg maxi

Ψi(x), (6)

we call ∇ΨF (x) the saliency map of F . We further call

α(x) :=|〈x,∇ΨF (x)(x)〉|‖∇ΨF (x)(x)‖

, (7)

the alignment with respect to Ψ in x.

2.2.1. LINEARIZED ROBUSTNESS

In general the distance to the decision boundary ρ(x) canbe unfeasible to compute. However, for classifiers built onlocally affine score functions – such as most neural networksusing ReLU or leaky ReLU activations – ρ(x) can easily becomputed, provided the locally affine region is sufficientlylarge. To quantify this, define the radius of the locally affinecomponent of F around x as

l(x) = sup{r | ∀i : Ψi affine in Br(x)},

where Br(x) is the open ball of radius r around x withrespect to the Euclidean metric.Lemma 1. Let F be a classifier with locally affine scorefunction Ψ. Assume l(x) ≥ ρ(x). Then

ρ(x) = minj 6=i∗

Ψi∗(x)−Ψj(x)

‖∇Ψi∗(x)−∇Ψj(x)‖, (8)

for i∗ := F (x) the predicted class at x.

Similar identities were previously also independently de-rived in (Elsayed et al., 2018) and (Jakubovitz & Giryes,2018).

Note that while nearly all state-of-the art classification net-works are piecewise affine, the condition l(x) ≥ ρ(x) istypically violated in practice. However, the lemma can stillhold approximately as long as the linear approximation tothe network’s score functions is sufficiently good in the rel-evant neighbourhood of x. This motivates the definition ofthe linearized (adversarial) robustness ρ.Definition 4 (Linearized Robustness). Let Ψ(x) be the dif-ferentiable score vector for the classifier F in x. We call

ρ(x) := minj 6=i∗

Ψi∗(x)−Ψj(x)

‖∇Ψi∗(x)−∇Ψj(x)‖, (9)

the linearized robustness in x, where i∗ := F (x) is thepredicted class at point x.

We later show that the two notions lead to very similarresults, even if the condition l(x) ≥ ρ(x) is violated.

2.2.2. REDUCING THE MULTI-CLASS CASE

In this section, we introduce a toolset which helps bridgethe gap between the alignment and the linearized robustnessof a multi-class classifier. In the following, for fixed x, leti∗ := F (x) and j∗ be the minimizer in (9). We can assignF in x a binarized classifier F †x with

F †x(y) := sgn(Ψ†x(y)

), (10)

where Ψ†x(y) := Ψi∗(y)−Ψj∗(y). Its linearized robustnessin y = x is the same as for F . The binarized saliency map,∇Ψ†x(x) = ∇yΨ†x(y)|y=x and the respective alignment,

α†(x) =|〈x,∇(Ψi∗ −Ψj∗)(x)〉|‖∇(Ψi∗ −Ψj∗)(x)‖

, (11)

which we call binarized alignment, offer an alternative,natural perspective of the above considerations. This isbecause for classifiers as defined in (6), the actual scorevalues do not necessarily carry any information about theclassification decision, whereas the score differences do.While, roughly speaking, ∇Ψi∗ tells us what F ’thinks’makes x a member of its predicted class, ∇Ψ†x(x) carriesinformation what sets x apart from its closest neighboringclass (according to linearization).

In the special case of a linear, multi-class classifier, we have

ρ(x) = ρ(x) = α†(x)

and in the linear, binary case Ψ(x) = (〈x, z〉,−〈x, z〉), even

α(x) = α†(x).

3. Decompositions and Bounds for NeuralNetworks

3.1. Homogeneous Decomposition

In the previous chapter we have seen that in the case ofbinary classifiers, the robustness and binarized alignmentcoincide for linear score functions. However, requiring Ψ tobe linear is a stronger assumption than necessary to deducethe result: It is in fact sufficient for Ψ to be positive one-homogeneous. Any such function satisfies Ψ(ax) = aΨ(x)for all a > 0 and x.

Lemma 2 (Linearized Robustness of Homogeneous Clas-sifiers). Consider a classifier F with positive one-homogeneous score functions. Then

ρ(x) = α†(x). (12)

Page 4: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

In particular, most feedforward neural networks with(leaky) ReLU activations without biases are positive one-homogeneous. This observation motivates to split up anyclassifier built on neural networks into a homogeneous termand the corresponding remainder, leading to the followingdecomposition result.Theorem 1 (Homogeneous Decomposition of Neural Net-works). Let Ψi

Θ,b be any logit of a neural network withReLU activations (of class N in the appendix). Denote byΘ the linear filters and by b the bias terms of the network.Then

ΨiΘ,b(x) = 〈x,∇xΨi

Θ,b(x)〉+ 〈b,∇bΨiΘ,b(x)〉

= 〈x,∇xΨiΘ,b(x)〉+

k

bk∂bkΨiΘ,b(x). (13)

Note that the above vector b includes the running averagesof the means for batch normalization. For ReLU networks,the remainder term βi(x) := 〈b,∇bΨi

Θ,b(x)〉 is locally con-stant, because it changes only when x enters another locallylinear region. For ease of notation, we will now drop thesubscripts Θ and b.

3.2. Pointwise Bounds

In section 2.1, we introduced two different viewpoints foraffine linear, binary classifiers which connect the robustnessto the alignment. In a similar vein to inequality (4) andequality (5), upper bounds to the linearized robustness de-pending on the alignment can be given for neural networks.In the following, we will write v := v/‖v‖ for v 6= 0.Again, in the following we fix x and write i∗ = F (x) andj∗ for the minimizer in j from equation (9).Theorem 2. Let g := ∇Ψi∗(x). Furthermore, letg† := ∇(Ψi∗ −Ψj∗)(x) and β† := βi

∗(x)−βj∗(x). Then

ρ(x) ≤ α†(x) +|β†|‖g†‖

(14)

≤ α(x) + ‖x‖ · ‖g† − g‖+|β†|‖g†‖

. (15)

Distances on the unit sphere (such as ‖g† − g‖) can beconverted to angles through the law of cosines. For theabove inequalities to be reasonably tight, the angle betweeng and g† needs to be small and |β†|/‖g†‖ needs to be smallin comparison to α†(x). In this case, the alignment shouldroughly increase with the linearized robustness.

Theorem 3. Let ξ := x + β†

‖g†‖g†

‖g†‖ and γ := ∇Ψi∗(ξ),with g† and β† defined as in the previous theorem. Then

ρ(x) ≤ |〈ξ, γ〉|‖γ‖

+ ‖ξ‖ · ‖g† − γ‖, (16)

and if additionally F (x) = F (ξ), then

ρ(x) ≤ α(ξ) + ‖ξ‖ · ‖g† − γ‖.

Depending on the sign of β†, the shifted image ξ can eitherbe understood as a gradient ascent or descent iterate formaximizing/minimizing Ψi∗ −Ψj∗ . This theorem assimi-lates β†(x) into x, providing an upper bound to ρ(x) thatdepends on α(ξ). The sensibility of this hinges on ξ beingreasonably close to x and γ having a low angle with g†.

If the error terms in inequalities (14), (15) and (16) aresmall, these inequalities thus provide a simple illustrationwhy more robust networks yield more interpretable saliencymaps.

Nevertheless, the right-hand side may be much larger thanρ(x), if the inner product between an image and its respec-tive saliency map are almost orthogonal. This is because theCauchy-Schwarz inequality (see the proofs in the appendix)provides a large upper bound in this case. The inequalitiesrather serve as an explanation of how the various terms ofalignment may deviate from the linearized robustness in thecase of a neural network.

3.3. Alignment and Interpretability

The above considerations demonstrate how an increase inrobustness may induce an increase in the alignment betweenan input image and its respective saliency map. The ini-tial observation – which was previously described as anincrease in interpretability – may thus be ascribed to thisphenomenon. This is especially true in the case of nat-ural images, as exemplified in Figure 1. There, what ahuman observer would deem an increase in interpretability,expresses itself as discriminative portions of the originalimage reappearing in the saliency map, which naturallyimplies a stronger alignment. The concepts of alignmentand interpretability should however not be conflated com-pletely: In the case of quasi-binary image data like MNIST,0-regions of the image render the inner product in equa-tion (7) invariant with respect to the saliency map in thisregion, even if the saliency map e.g. assigns relevance tothe absence of a feature in this region. Note however thatthe saliency map in this region still influences the alignmentterm through the division by its norm. Additionally, thealignment is also not invariant to the images’ representation(color space, shifts, normalization etc.). Still, for most typesof image data an increase in alignment in discriminativeregions should coincide with an increase in interpretability.

4. ExperimentsIn order to validate our hypothesis, we trained several mod-els of different adversarial robustness on both MNIST (Le-Cun et al., 1990) and ImageNet (Deng et al., 2009) usingdouble backpropagation (Drucker & Le Cun, 1992). For aneural network fθ with a softmax output layer, this amounts

Page 5: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

50 100 150 200 250 300 350 400

100

200

300

400

M [ρ(x)]

M[α(x)]

Gradient Attack

Projected Gradient Descent

Carlini-Wagner

Linearized Robustness

Figure 2. The median alignment increases with the median robust-ness of the model on ImageNet. Furthermore, the more elaborateattacks consistently find smaller adversarial perturbations than thesimple gradient attack. The linearized robustness estimator pro-vides a rather realistic estimation of the algorithmically calculatedrobustness.

to minimizing the modified loss

1

N

N∑

i=1

[L(fθ(x

(i)), y(i)) + λ · ‖∇L(fθ(x(i)), y(i))‖2

]

(17)over the parameters θ = (Θ, b). Here, {(x(i), y(i))}i=1,...,N

is the training set and L denotes the negative log-likelihooderror function. The hyperparameter λ ≥ 0 determines thestrength of the regularization. Note that this penalizes thelocal Lipschitz constant of the loss. As (Simon-Gabrielet al., 2018) demonstrate, double backpropagation makesneural networks more resilient to adversarial attacks. Byvarying λ, we can easily create models of different adversar-ial robustness for the same dataset, whose properties we canthen compare. (Anil et al., 2018) previously noted that Lip-schitz constrained networks exhibit interpretable saliencymaps (without an explanation), which can be regarded as aside-effect of the increase in adversarial robustness.For the MNIST experiments, we trained each of our 16 mod-els on an NVIDIA 1080Ti GPU with a batch size of 100for 200 epochs, covering the regularization hyperparameterrange from 10 to 180,000, before the models start to degen-erate. The used architecture is found in the appendix.For the experiments on ImageNet, we fine-tuned the pre-trained ResNet50 model from (He et al., 2016) over 35epochs on 2 NVIDIA P100 GPUs with a total batch size of32. We used stochastic gradient descent with a learning rateof 0.0001 and momentum of 0.99. The learning rate was di-vided by 10 whenever the error stopped improving. For theregularization parameter, we chose λ = 104, 104.5, . . . , 107.The experiments were implemented in Tensorflow (Abadiet al., 2015).

1.5 2 2.5 3

2.5

3

3.5

4

4.5

M [ρ(x)]

M[α(x)]

Gradient Attack

Projected Gradient Descent

Carlini-Wagner

Linearized Robustness

Figure 3. Similar to Figure 2, the median alignment increases withthe median robustness of the model on MNIST. Towards the end,some saturation effects are visible.

4.1. Robustness and Alignment

For checking the relation between the alignment and ro-bustness of a neural network, we created 1000 adver-sarial examples per model on the respective validationset. This was realized using the python library Fool-box (Rauber et al., 2017), which offers pre-defined ad-versarial attacks, three of which we used in this paper:The GradientAttack performs a line search for theclosest adversarial example along the direction of theloss gradient. L2BasicInterativeAttack imple-ments the projected gradient descent attack from (Ku-rakin et al., 2016) for the Euclidean metric. Similarly,CarliniWagnerL2Attack (CW-attack) is the attackintroduced in (Carlini & Wagner, 2017) suited for findingthe closest adversarial example in Euclidean metric. Addi-tionally, we calculated the linearized robustness ρ(x), whichentails calculating n gradients per image for an n-class prob-lem.In Figures 2 and 3, we investigate how the median align-ment depends on the medians of the different conceptionsof robustness. We opted in favor of the median (M ) insteadof the arithmetic mean due to its increased robustness tooutliers, which occurred especially when using the gradientattack. In the case of ImageNet (Figure 2), an increase inmedian alignment with the median robustness is clearly vis-ible for all three estimates of the robustness. On the otherhand, the alignment for the MNIST data increases with therobustness as well, but seems to saturate at some point. Wewill offer an explanation for this phenomenon later.

We now consider the pointwise connection between ro-bustness and alignment. In Figure 4 the two variables arehighly-correlated for a model trained on MNIST, pointingtowards the fact that the network behaves very similarly toa positive one-homogeneous function. There is however no

Page 6: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

0 500 1,000 1,500 2,000 2,500

0

500

1,000

1,500

2,000

2,500

ρ(x)

α(x)

0 1 2 3 4 5

0

2

4

6

ρ(x)

α(x)

Figure 4. The pointwise relationship between ρ(x) and α(x), ex-emplified on a model trained on ImageNet (left) and MNIST (right).While the two properties are well-correlated on MNIST (fitting the’averaged’ view from Figure 3), there is no visible correlation inthe case of ImageNet.

0 500 1,000 1,500 2,000 2,5000

500

1,000

1,500

2,000

2,500

3,000

ρ(x)

ρ(x)

0 1 2 3 4 5

0

1

2

3

4

ρ(x)

ρ(x)

Figure 5. The pointwise relationship between ρ(x) and ρ(x), eachcalculated for 1000 validation points on a model trained on Im-ageNet (left) and MNIST (right). ρ(x) was approximately cal-culated using the CW-attack. In both cases, the correlation ishigh.

visible correlation between them on the ImageNet model,which is a consistent behavior throughout the whole ex-periment cohort. We will later analyse the source of thisbehavior. The increase in median alignment for ImageNet,M [α(x)] = M [|〈x, g〉|], can still be explained by a statis-tical argument: If M [〈x, g〉] = 0, as approximately true inour ImageNet model, then M [α(x)] is the median absolutedeviation of 〈x, g〉. In other words, the graph for ImageNetin Figure 4 depicts the dispersion of 〈x, g〉. The above ob-servations also hold well for the binarized alignment.

In Figure 5 a tight correlation between ρ(x) and ρ(x) be-comes evident. Here, the latter has been calculated usingthe CW-attack. The linearized robustness model ρ is hencean adequate approximation of the actual robustness ρ, evenfor the highly non-linear neural network models used onImageNet. Finally note that all used attacks lead to the samegeneral behavior of all quantities investigated (see Figures2 and 3).

4.2. Explaining the Observations

In the last section, we observed some commonalities be-tween the experiments on ImageNet and MNIST, but alsosome very different behaviors. In particular, two aspectsstand out: Why does the median alignment steadily increase

for the observed ImageNet experiments, whereas on MNISTthis stagnates at some point (Figures 2 and 3)? Furthermore,why are ρ(x) and α(x) so highly-correlated on MNIST butalmost uncorrelated on ImageNet (Figure 4)? We turn toTheorems 2 and 3 for answers.Theorem 2 states that

ρ(x) ≤ α†(x) +|β†|‖g†‖

, (18)

where β† is the locally constant term and g† is the saliencymap of the binarized classifier and v = v/‖v‖ for v 6= 0.In Figure 6, we check how strongly the right-hand side ofinequality (18) is dominated by α†(x), i.e. how large theinfluence of the locally linear term is in comparison to thelocally constant term. For ImageNet, this ratio increasesfrom below 0.55 to almost 0.85, pointing towards a modelincreasingly governed by its linearized part. On MNIST, thisratio strongly decreases over the robustness’s range. Notehowever that in the weakly regularized MNIST models,the right hand side is extremely dominated by the medianalignment in the first place.

A similar analysis can be performed for the second inequal-ity from Theorem 2,

ρ(x) ≤ α(x) + ‖x‖ · ‖g† − g‖+|β†|‖g†‖

, (19)

which additionally makes a step from binarized alignmentto (conventional) alignment.

This leads to an additional error term, making the boundsignificantly less tight than in the previous case. In particu-lar, the proportion of the alignment α on the right-hand sidediminishes, confirming our prediction from section 3.2. Nev-ertheless, the qualitative behaviors is similar to the previouscase, with the α(x) taking up an increasing fraction of theright-hand with increasing robustness. For MNIST data, theratio varies little compared to the ratio from the last inequal-ity. This indicates that the remainder term ‖g† − g‖ doesnot change too strongly over the set of MNIST experimentscompared to α(x). We thus deduce that the qualitative rela-tionship between robustness and alignment is fully governedby the error term introduced in (18), i.e. the locally constantterm of the logit.

We now do the same for the inequality in Theorem 3, whichstates that

ρ(x) ≤ |〈ξ, γ〉|‖γ‖

+ ‖ξ‖ · ‖g† − γ‖ (20)

for ξ = x+ b†

‖g†‖g†

‖g†‖ and γ = ∇ΨF (x)(ξ), which gets ridof the additive term |β†|/‖g†‖ from (18). Again, in the caseof ImageNet |〈ξ, γ〉| grows more quickly in comparison to‖g† − γ‖, the distance of the normalized gradients, whereas

Page 7: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

50 100 150 200 250 300

0.55

0.6

0.65

0.7

0.75

0.8

0.85

M [ρ(x)]

M[

α† (x)

|β† |/‖g

† ‖

]

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

0

20

40

60

80

100

M [ρ(x)]

M[

α† (x)

|β† |/‖g

† ‖

]

Figure 6. Comparing the size of the summands of inequality (18)for the various experiments. In the case of ImageNet (left), α†(x)takes up an increasing fraction of the right-hand side of the inequal-ity. For MNIST (right), this portion tends to strongly decrease withthe robustness. Note however that in this case, α†(x) starts outvastly dominating the right-hand side.

50 100 150 200 250 300

0.2

0.4

0.6

0.8

1

1.2

1.4

·10−2

M [ρ(x)]

M[

α(x

)|β† |/‖g† ‖

+‖x‖·‖g† −g‖]

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

0.45

0.5

0.55

0.6

M [ρ(x)]

M[

α(x

)|β

† |/‖g

† ‖+‖x‖·‖g

† −g‖]

Figure 7. Comparing the size of the summands of inequality (19)for the various experiments. For the ImageNet experiments (left),the portion of α(x) of the right-hand side of the inequality in-creases roughly 7-fold. For MNIST (right), this portion staysroughly constant compared to the variation from Figure 6.

50 100 150 200 250 300

2

2.2

2.4

2.6

2.8

3

3.2

·10−5

M [ρ(x)]

M[

|〈ξ,γ〉|

‖γ‖·‖ξ‖·‖g† −γ‖]

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

0.24

0.26

0.28

0.3

M [ρ(x)]

M[

|〈ξ,γ〉|

‖γ‖·‖ξ‖·‖g

† −γ‖]

Figure 8. Comparing the size of the summands of inequality (20)for the various experiments. In the case of ImageNet (left), thealignment in ξ takes up an increasingly large portion of the right-hand side of the inequality. For MNIST (right), this portion staysroughly constant.

their ratio is approximately constant for MNIST data.

To conclude, we have seen that the upper bounds from The-orems 2 and 3 provide valuable information in which waysboth the experiments on ImageNet and MNIST are influ-enced by the respective terms. In the case of ImageNet, weconsistently see the alignment terms growing more quicklythan the other terms. This might indicate that the growthin alignment stems not only from the growth in the robust-ness alone, but also from the model becoming increasinglysimilar to our idealized toy example. In other words, notonly does the robustness make the alignment grow, but theconnection between these two properties becomes strongerin the case of ImageNet. This is in agreement with the seem-ingly superlinear growth of the median alignment in Figure2.It is not surprising that a classifier for a problem as complexas ImageNet is highly non-linear, which makes the (point-wise) connection between alignment and robustness ratherloose. We hence conjecture that the imposed regularizationincreasingly restricts the models to be more linear, therebymaking them more similar to our initial toy example.For MNIST, the regularization seems to have the oppositeeffect: As seen in Figure 6, the binarized alignment ini-tially dwarfs the correction term |β†|/‖g†‖ introduced bythe locally constant portion of the binarized logit Ψ†x(x).As the network becomes more robust, Ψ†x(x) is apparentlynot dominated by the linear terms anymore, while the influ-ence of the locally constant terms (i.e. β†) increases. Thishypothesis seems sensible, considering MNIST is a verysimple problem which we tackled with a comparatively shal-low network. This can be expected to yield a model with alow degree of non-linearity. The penalization of the localLipschitz constant here seems to have the effect of requiringlarger locally constant terms |β†|, in contrast to the modelstrained on ImageNet.

We check the validity of these claims by tracking the mediansize of |〈x, g†〉| against the median size of |Ψ†x(x)| in Figure9. On MNIST, M

[|〈x, g†〉|

]starts out at approximately

40% of M[|Ψ†x(x)|

]and at the end rises to almost 100%.

Note that this does not indicate that βi is typically close to0 for all i, just that β† is, compared to 〈x, g†〉.On MNIST, this ratio is close to 1 up until M [ρ(x)] ≈ 2.4,when it suddenly and quickly falls below 0.5. This dropis consistent with what we see in Figure 3: At around thesame point this drop occurs, the alignment starts to saturate.While an increase in the model’s median robustness shouldimply an increase in the model’s median alignment, thedeviation from linearity weakens the connection betweenrobustness and alignment, such that the two effects roughlycancel out.

In Figure 10, we provide examples for the different gradient

Page 8: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

50 100 150 200 250 300

0.4

0.5

0.6

0.7

0.8

0.9

1

M [ρ(x)]

M[|〈

x,g

† 〉| ]

M[|Ψ

† (x

)|]

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

0.5

0.6

0.7

0.8

0.9

1

M [ρ(x)]

M[|〈

x,g

† 〉| ]

M[|Ψ

† (x

)|]

Figure 9. On the ImageNet experiments, the linear term |〈x, g†〉|takes up an increasing portion of the binarized score Ψ†(x). Inthe case of MNIST, Ψ†(x) is completely dominated by the linearterm, before its influence decreases sharply at M [ρ(x)] ≈ 2.4.

concepts we introduced in Theorems 2 and 3, both for themost robust and non-robust network from our experimentcohort.

5. Conclusion and OutlookIn this paper, we investigated the connection between aneural network’s robustness to adversarial attacks and theinterpretability of the resulting saliency maps. Motivatedby the binary, linear case, we defined the alignment α as ameasure of how much a saliency map matches its respectiveimage. We hypothesized that the perceived increase ininterpretability is due to a higher alignment and tested thishypothesis on models trained on MNIST and ImageNet.While on average, the proposed relation holds well, theconnection is much less pronounced for individual points,especially on ImageNet. Using some upper bounds for therobustness of a neural network, which we derived usinga decomposition theorem, we arrived at the conclusionthat the strength of this connection is strongly linked withhow similar to a linear model the neural network is locally.As ImageNet is a comparatively complex problem, anysufficiently accurate model is bound to be very non-linear,which explains the difference to MNIST.While this paper shows the general link between robustnessand alignment, there are still some open questions. Sincewe only used one specific robustification method, furtherexperiments should determine the influence of this method.One could explore, whether a different choice of normleads to different observations. Another future direction ofresearch could be to investigate the degree of (non-)linearityand its connection to this topic. While Theorems 2 and3 illustrate how the pointwise linearized robustness andalignment may diverge, depending on terms like g, g†, γand β†, a more in-depth look should focus on why and whenthese terms have a certain relationship to each other.

From a methodological standpoint, the discovered connec-tion may also serve as an inspiration for new adversarialdefenses, where not only the robustness but also the align-

ment is taken into account. One way of increasing thealignment directly would be through the penalty term

λ(‖x‖2‖∇Ψi(x)‖2 − 〈x,∇Ψi(x)〉2

),

which is bounded from below by 0 via the Cauchy-Schwarzinequality. Any robustifying effects of the increased align-ment may however be confounded with the Lipschitz-penalty that the first summand effectively introduces, whichnecessitates a careful experimental evaluation.

∇Ψi∗ (x) ∇Ψi∗ (ξ) ∇Ψj∗ (x) ∇(Ψi∗ −Ψj∗ )(x)

Figure 10. Selected examples from the ImageNet validation setof the different gradients and their respective alignments with x,respectively ξ. The odd rows are generated with the most robustImageNet classifier, whereas the even rows are generated by theleast robust classifier. The gradient images are individually scaledto fit the color range [0, 255].

Page 9: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

AcknowledgementsCE and PM acknowledge funding by the DeutscheForschungsgemeinschaft (DFG) - Projektnummer281474342: ’RTG π3 - Parameter Identification - Analysis,Algorithms, Applications’. The work by SL was supportedby the EPSRC grant EP/L016516/1 for the University ofCambridge Centre for Doctoral Training, the CambridgeCentre for Analysis and by the Cantab Capital Institutefor the Mathematics of Information. CBS acknowledgessupport from the Leverhulme Trust projects on Breakingthe non-convexity barrier and on Unveiling the Invisi-ble, the Philip Leverhulme Prize, the EPSRC grant Nr.EP/M00483X/1, the EPSRC Centre Nr. EP/N014588/1, theEuropean Union Horizon 2020 research and innovationprogrammes under the Marie Skodowska-Curie grantagreement No 777826 NoMADS and No 691070 CHiPS,the Cantab Capital Institute for the Mathematics ofInformation and the Alan Turing Institute. We gratefullyacknowledge the support of NVIDIA Corporation with thedonation of a Quadro P6000 and a Titan Xp GPUs used forthis research.

ReferencesAbadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev-enberg, J., Mane, D., Monga, R., Moore, S., Murray, D.,Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan,V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M.,Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.URL https://www.tensorflow.org/. Softwareavailable from tensorflow.org.

Anil, C., Lucas, J., and Grosse, R. Sorting out lipschitz func-tion approximation. arXiv preprint arXiv:1811.05381,2018.

Carlini, N. and Wagner, D. A. Towards evaluating therobustness of neural networks. In 2017 IEEE Symposiumon Security and Privacy, SP 2017, San Jose, CA, USA,May 22-26, 2017, pp. 39–57, 2017.

Cortes, C. and Vapnik, V. Support-vector networks. Ma-chine learning, 20(3):273–297, 1995.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,L. Imagenet: A large-scale hierarchical image database.In Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on, pp. 248–255. Ieee, 2009.

Drucker, H. and Le Cun, Y. Improving generalization perfor-

mance using double backpropagation. IEEE Transactionson Neural Networks, 3(6):991–997, 1992.

Elsayed, G. F., Krishnan, D., Mobahi, H., Regan, K., andBengio, S. Large margin deep networks for classifica-tion. 2018. URL https://arxiv.org/pdf/1803.05598.pdf.

Goodfellow, I., Shlens, J., and Szegedy, C. Explaining andharnessing adversarial examples. arXiv 1412.6572, 2014.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Jakubovitz, D. and Giryes, R. Improving dnn robustnessto adversarial attacks using jacobian regularization. InThe European Conference on Computer Vision (ECCV),September 2018.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversar-ial examples in the physical world. arXiv preprintarXiv:1607.02533, 2016.

LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D.,Howard, R. E., Hubbard, W. E., and Jackel, L. D. Hand-written digit recognition with a back-propagation network.In Advances in neural information processing systems,pp. 396–404, 1990.

LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na-ture, 521(7553):436, 2015.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant toadversarial attacks. 2018.

Montavon, G., Lapuschkin, S., Binder, A., Samek, W., andMuller, K.-R. Explaining nonlinear classification deci-sions with deep taylor decomposition. Pattern Recogni-tion, 65:211–222, 2017.

Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deep-fool: a simple and accurate method to fool deep neu-ral networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 2574–2582, 2016.

Rauber, J., Brendel, W., and Bethge, M. Foolbox v0. 8.0: Apython toolbox to benchmark the robustness of machinelearning models. arXiv preprint arXiv:1707.04131, 2017.

Schott, L., Rauber, J., Brendel, W., and Bethge, M. Ro-bust perception through analysis by synthesis. CoRR,abs/1805.09190, 2018. URL http://arxiv.org/abs/1805.09190.

Page 10: On the Connection Between Adversarial Robustness and ...proceedings.mlr.press/v97/etmann19a/etmann19a.pdf · the data that is uniform over all data points x. We will see how these

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

Simon-Gabriel, C.-J., Ollivier, Y., Scholkopf, B., Bottou, L.,and Lopez-Paz, D. Adversarial vulnerability of neuralnetworks increases with input dimension. arXiv preprintarXiv:1802.01421, 2018.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in-side convolutional networks: Visualising image clas-sification models and saliency maps. arXiv preprintarXiv:1312.6034, 2013.

Smilkov, D., Thorat, N., Kim, B., Viegas, F., and Watten-berg, M. Smoothgrad: Removing noise by adding noise.arXiv preprint arXiv:1706.03825, 2017.

Sokolic, J., Giryes, R., Sapiro, G., and Rodrigues, M. R.Robust large margin deep neural networks. IEEE Trans-actions on Signal Processing, 65(16):4265–4280, 2017.

Springenberg, J., Dosovitskiy, A., Brox, T., and Ried-miller, M. Striving for simplicity: The all convolu-tional net. In ICLR (workshop track), 2015. URLhttp://lmb.informatik.uni-freiburg.de/Publications/2015/DB15a.

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribu-tion for deep networks. arXiv preprint arXiv:1703.01365,2017.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I. J., and Fergus, R. Intriguing propertiesof neural networks. 2014.

Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., andMadry, A. Robustness may be at odds with accuracy. InInternational Conference on Learning Representations,2019. URL https://openreview.net/forum?id=SyxAb30cY7.