21
Is Attention Interpretable? Sofia Serrano * Noah A. Smith *† * Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA Allen Institute for Artificial Intelligence, Seattle, WA, USA {sofias6,nasmith}@cs.washington.edu Abstract Attention mechanisms have recently boosted performance on a range of NLP tasks. Because attention layers explicitly weight input compo- nents’ representations, it is also often assumed that attention can be used to identify informa- tion that models found important (e.g., specific contextualized word tokens). We test whether that assumption holds by manipulating atten- tion weights in already-trained text classifica- tion models and analyzing the resulting differ- ences in their predictions. While we observe some ways in which higher attention weights correlate with greater impact on model predic- tions, we also find many ways in which this does not hold, i.e., where gradient-based rank- ings of attention weights better predict their ef- fects than their magnitudes. We conclude that while attention noisily predicts input compo- nents’ overall importance to a model, it is by no means a fail-safe indicator. 1 1 Introduction Interpretability is a pressing concern for many current NLP models. As they become increas- ingly complex and learn decision-making func- tions from data, ensuring our ability to understand why a particular decision occurred is critical. Part of that development has been the incorpo- ration of attention mechanisms (Bahdanau et al., 2015) into models for a variety of tasks. For many different problems—to name a few, ma- chine translation (Luong et al., 2015), syntactic parsing (Vinyals et al., 2015), reading comprehen- sion (Hermann et al., 2015), and language model- ing (Liu and Lapata, 2018)—incorporating atten- tion mechanisms into models has proven benefi- cial for performance. While there are many vari- ants of attention (Vaswani et al., 2017), each for- 1 Code is available at https://github.com/ serrano-s/attn-tests. mulation consists of the same high-level goal: cal- culating nonnegative weights for each input com- ponent (e.g., word) that together sum to 1, multi- plying those weights by their corresponding repre- sentations, and summing the resulting vectors into a single fixed-length representation. Since attention calculates a distribution over in- puts, prior work has used attention as a tool for in- terpretation of model decisions (Wang et al., 2016; Lee et al., 2017; Lin et al., 2017; Ghaeini et al., 2018). The existence of so much work on visual- izing attention weights is a testament to attention’s popularity in this regard; to name just a few ex- amples of these weights being examined to under- stand a model, recent work has focused on goals from explaining and debugging the current sys- tem’s decision (Lee et al., 2017; Ding et al., 2017) to distilling important traits of a dataset (Yang et al., 2017; Habernal et al., 2018). Despite this, existing work on interpretability is only beginning to assess what computed atten- tion weights actually communicate. In an indepen- dent and contemporaneous study, Jain and Wallace (2019) explore whether attention mechanisms can identify the relative importance of inputs to the full model, finding them to be highly inconsistent pre- dictors. In this work, we apply a different analy- sis based on intermediate representation erasure to assess whether attention weights can instead be re- lied upon to explain the relative importance of the inputs to the attention layer itself. We find sim- ilar cause for concern: attention weights are only noisy predictors of even intermediate components’ importance, and should not be treated as justifica- tion for a decision. 2 Testing for Informative Interpretability We focus on five- and ten-class text classification models incorporating attention, as explaining the arXiv:1906.03731v1 [cs.CL] 9 Jun 2019

y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Is Attention Interpretable?

Sofia Serrano∗ Noah A. Smith∗†∗Paul G. Allen School of Computer Science & Engineering,

University of Washington, Seattle, WA, USA†Allen Institute for Artificial Intelligence, Seattle, WA, USA{sofias6,nasmith}@cs.washington.edu

Abstract

Attention mechanisms have recently boostedperformance on a range of NLP tasks. Becauseattention layers explicitly weight input compo-nents’ representations, it is also often assumedthat attention can be used to identify informa-tion that models found important (e.g., specificcontextualized word tokens). We test whetherthat assumption holds by manipulating atten-tion weights in already-trained text classifica-tion models and analyzing the resulting differ-ences in their predictions. While we observesome ways in which higher attention weightscorrelate with greater impact on model predic-tions, we also find many ways in which thisdoes not hold, i.e., where gradient-based rank-ings of attention weights better predict their ef-fects than their magnitudes. We conclude thatwhile attention noisily predicts input compo-nents’ overall importance to a model, it is byno means a fail-safe indicator.1

1 Introduction

Interpretability is a pressing concern for manycurrent NLP models. As they become increas-ingly complex and learn decision-making func-tions from data, ensuring our ability to understandwhy a particular decision occurred is critical.

Part of that development has been the incorpo-ration of attention mechanisms (Bahdanau et al.,2015) into models for a variety of tasks. Formany different problems—to name a few, ma-chine translation (Luong et al., 2015), syntacticparsing (Vinyals et al., 2015), reading comprehen-sion (Hermann et al., 2015), and language model-ing (Liu and Lapata, 2018)—incorporating atten-tion mechanisms into models has proven benefi-cial for performance. While there are many vari-ants of attention (Vaswani et al., 2017), each for-

1Code is available at https://github.com/serrano-s/attn-tests.

mulation consists of the same high-level goal: cal-culating nonnegative weights for each input com-ponent (e.g., word) that together sum to 1, multi-plying those weights by their corresponding repre-sentations, and summing the resulting vectors intoa single fixed-length representation.

Since attention calculates a distribution over in-puts, prior work has used attention as a tool for in-terpretation of model decisions (Wang et al., 2016;Lee et al., 2017; Lin et al., 2017; Ghaeini et al.,2018). The existence of so much work on visual-izing attention weights is a testament to attention’spopularity in this regard; to name just a few ex-amples of these weights being examined to under-stand a model, recent work has focused on goalsfrom explaining and debugging the current sys-tem’s decision (Lee et al., 2017; Ding et al., 2017)to distilling important traits of a dataset (Yanget al., 2017; Habernal et al., 2018).

Despite this, existing work on interpretabilityis only beginning to assess what computed atten-tion weights actually communicate. In an indepen-dent and contemporaneous study, Jain and Wallace(2019) explore whether attention mechanisms canidentify the relative importance of inputs to the fullmodel, finding them to be highly inconsistent pre-dictors. In this work, we apply a different analy-sis based on intermediate representation erasure toassess whether attention weights can instead be re-lied upon to explain the relative importance of theinputs to the attention layer itself. We find sim-ilar cause for concern: attention weights are onlynoisy predictors of even intermediate components’importance, and should not be treated as justifica-tion for a decision.

2 Testing for Informative Interpretability

We focus on five- and ten-class text classificationmodels incorporating attention, as explaining the

arX

iv:1

906.

0373

1v1

[cs

.CL

] 9

Jun

201

9

Page 2: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Figure 1: Our method for calculating the importanceof representations corresponding to zeroed-out atten-tion weights, in a hypothetical setting with four outputclasses .

reasons for text classification has been a particulararea of interest for recent work in interpretability(Yang et al., 2016; Ribeiro et al., 2016; Lei et al.,2016; Feng et al., 2018).

In order for a model to be interpretable, it mustnot only suggest explanations that make sense topeople, but also ensure that those explanations ac-curately represent the true reasons for the model’sdecision. Note that this type of analysis does notrely on the true labels of the data; if a model pro-duces an incorrect output, but a faithful explana-tion for which factors were important in that cal-culation, we still consider it interpretable.

We take the implied explanation provided by vi-sualizing attention weights to be a ranking of im-portance of the attention layer’s input representa-tions, which we denote I: if the attention allocatedto item i ∈ I is higher than that allocated to itemj ∈ I, then i is presumed “more important” thanj to the model’s output. In this work, we focuson whether the attention weights’ suggested im-portance ranking of I faithfully describes why themodel produced its output, echoing existing workon explanation brittleness for other model compo-nents (Ghorbani et al., 2017).

2.1 Intermediate Representation Erasure

We are interested in the impact of some contex-tualized inputs to an attention layer, I ′ ⊂ I, onthe model’s output. To examine the importanceof I ′, we run the classification layer of the modeltwice (Figure 1): once without any modification,and once after renormalizing the attention distri-bution with I ′’s attention weights zeroed out, sim-

ilar to other erasure-based work (Li et al., 2016;Feng et al., 2018). We then observe the result-ing effects on the model’s output. We erase at theattention layer to isolate the effects of the atten-tion layer from the encoder preceding it. Our rea-soning behind renormalizing is to keep the outputdocument representation from artificially shrink-ing closer to 0 in a way never encountered duringtraining, which could make subsequent measure-ments unrepresentative of the model’s behavior inspaces to which it does map inputs.

One point worth noting is the facet of inter-pretability that our tests are designed to capture.By examining only how well attention representsthe importance of intermediate quantities, whichmay themselves already have changed uninter-pretably from the model’s inputs, we are testingfor a relatively low level of interpretability. Sofar, other work looking at attention has examinedwhether attention suffices as a holistic explanationfor a model’s decision (Jain and Wallace, 2019),which is a higher bar. We instead focus on the low-est standard of interpretability that attention mightbe expected to meet, ignoring prior model layers.

We denote the output distributions (over labels)as p (the original) and qI′ (where we erase atten-tion for I ′). The question now becomes how to op-erationalize “importance” given p and qI′ . Thereare many quantities that could arguably capture in-formation about importance. We focus on two: theJensen-Shannon (JS) divergence between outputdistributions p and qI′ , and whether the argmaxesof p and qI′ differ, indicating a decision flip.

3 Data and Models

We investigate four model architectures on a topicclassification dataset (Yahoo Answers; Zhanget al., 2015) and on three review ratings datasets:IMDB (Diao et al., 2014),2 Yelp 2017,3 and Ama-zon (Zhang et al., 2015). Statistics for each datasetare listed in Table 1.

Our model architectures are inspired by the hi-erarchical attention network (HAN; Yang et al.,2016), a text classification model with two lay-ers of attention, first to the word tokens in eachsentence and then to the resulting sentence repre-sentations. The layer that classifies the documentrepresentation is linear with a softmax at the end.

We conduct our tests on the softmax formula-2downloaded from github.com/nihalb/JMARS3from www.yelp.com/dataset_challenge

Page 3: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Dataset Av. # Words (s.d.) Av. # Sents. (s.d.) # Train. + Dev. # Test # ClassesYahoo Answers 104 (114) 6.2 (5.9) 1,400,000 50,000 10IMDB 395 (259) 16.2 (10.7) 122,121 13,548 10Amazon 73 (48) 4.3 (2.6) 3,000,000 650,000 5Yelp 125 (109) 7.0 (5.6) 650,000 50,000 5

Table 1: Dataset statistics.

Figure 2: Flat attention network (FLAN) demonstrat-ing a convolutional encoder. Each contextualized wordrepresentation is the concatenation of two sizes of con-volutions: one applied over the input representationand its two neighbors to either side, and the other ap-plied over the input representation and its single neigh-bor to either side. For details, see Appendix A.1.

tion of attention,4 which is used by most models,including the HAN. Specifically, we use the ad-ditive formulation originally defined in Bahdanauet al. (2015). Given attention layer `’s learned pa-rameters, element i of a sequence, and its encodedrepresentation hi, the attention weight αi is com-puted using `’s learned context vector c` as fol-lows:

ui = tanh(W`hi + b`)

αi =expu>i c`∑i expu>i c`

We evaluate on the original HAN architecture, butwe also vary that architecture in two key ways:

1. Number of attention layers: besides explor-ing models with a final layer of attention oversentence representations (which we denotewith a “HAN” prefix), we also train “flat” at-tention networks with only one layer of at-tention over all contextualized word tokens

4Alternatives such as sparse attention (Martins and As-tudillo, 2016) and unnormalized attention (Ji and Smith,2017) have been proposed.

(which we denote with a “FLAN” prefix). Ineither case, though, we only run tests on mod-els’ final layer of attention.

2. Reach of encoder contextualization: Theoriginal HAN uses recurrent encoders to con-textualize input tokens prior to an attentionlayer (specifically, bidirectional GRUs run-ning over the full sequence). Aside frombiRNNs, we also experiment with modelsthat instead contextualize word vectors byconvolutions on only a token’s close neigh-bors, inspired by Kim (2014). See Figure 2for a diagram of the FLAN architecture usinga convolutional encoder. We denote this vari-ant of an architecture with a “conv” suffix.Finally, we also test models that are trainedwith no contextualizing encoder at all; we de-note these with a “noenc” suffix.

The classification accuracy of each of our trainedmodels is listed in Table 3 in the appendix, alongwith training details for the different models.

4 Single Attention Weights’ Importance

As a starting point for our tests, we investigate therelative importance of attention weights when onlyone weight is removed. Let i∗ ∈ I be the compo-nent with the highest attention and let αi∗ be itsattention. We compare i∗’s importance to someother attended item’s importance in two ways.

4.1 JS Divergence of Model OutputDistributions

We wish to compare how i∗’s impact on themodel’s output distribution compares to the im-pact corresponding to a random attended item rdrawn uniformly from I. Our first approach to thiswill be to calculate two JS divergences—one beingthe JS divergence of the model’s original outputdistribution from its output distribution after re-moving only i∗, and the other after removing onlyr—and compare them to each other. We subtract

Page 4: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

the output JS divergence after removing r from theoutput JS divergence after removing i∗:

∆JS = JS(p, q{i∗})− JS(p, q{r}) (1)

We plot this quantity against the difference ∆α =αi∗−αr in Figure 3. We show results on the HAN-rnn, as the trends for the other models are verysimilar; see Figures 7–8 and the tables in Figure 9in the Appendix for full results.

Figure 3: Difference in attention weight magnitudesversus ∆JS for HANrnns, comparable to results for theother architectures; for their plots, see Appendix A.2.

Figure 4: These are the counts of test instances forthe HANrnn models for which i∗’s JS divergence wassmaller, binned by ∆α. These counts comprise a smallfraction of the test set sizes listed in Table 1.

Intuitively, if i∗ is truly the most important, thenwe would expect Eq. 1 to be positive, and that iswhat we find the vast majority of the time. In ad-dition, examining Figure 3, we see that nearly all

negative ∆JS values are close to 0. By binning oc-currences of negative ∆JS values by the differencebetween αi∗ and αr in Figure 4, we also see thatin the cases where i∗ had a smaller effect, the gapbetween i∗’s attention and r’s tends to be small.This is encouraging, indicating that in these cases,i∗ and r are nearly “tied” in attention.

However, the picture of attention’s interpretabil-ity grows somewhat more murky when we beginto consider the magnitudes of positive ∆JS valuesin Figure 3. We notice across datasets that evenfor quite large differences in attention weights like0.4, many of the positive ∆JS are still quite closeto zero. Although we do finally see an upwardswing in ∆JS values once ∆α gets even larger,indicating only one very high attention weight inthe distribution, this still leaves many open ques-tions about exactly how much difference in impacti∗ and r can typically be expected to have.

4.2 Decision Flips Caused by ZeroingAttention

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.5 8.7 Yes 2.2 12.2No 1.3 89.6 No 1.4 84.2

Amazon YelpYes No Yes No

Yes 2.7 7.6 Yes 1.5 8.9

Rem

ovei∗

:Dec

isio

nfli

p?

No 2.7 87.1 No 1.9 87.7

Table 2: Percent of test instances in each decision-flipindicator variable category for each HANrnn.

Since attention weights are often interpreted asan explanation for a model’s argmax decision, oursecond test looks at another, more immediatelyvisible change in model outputs: decision flips.For clarity, we limit our discussion to results forthe HANrnns, which reflect the same patterns ob-served for the other architectures. (Results for allother models are in Appendix A.2.)

Table 2 shows, for each dataset, a contingencytable for the two binary random variables (i) doeszeroing αi∗ (and renormalizing) result in a deci-sion flip? and (ii) does doing the same for a dif-ferent randomly chosen weight αr result in a de-cision flip? To assess the comparative importanceof i∗ and r, we consider when exactly one era-sure changes the decision (off-diagonal cells). For

Page 5: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

attention to be interpretable, the blue, upper-rightvalues (i∗, not r, flips a decision) should be muchlarger than the orange, lower-left values (r, not i∗,flips a decision), which should be close to zero.5

Although for some datasets in Table 2, the “or-ange” values are non-negligible, we mostly seethat their fraction of total off-diagonal values mir-rors the fraction of negative occurrences of Eq. 1in Figure 4. However, it’s somewhat startling thatin the vast majority of cases, erasing i∗ does notchange the decision (“no” row of each table). Thisis likely explained in part by the signal pertinentto the classification being distributed across a doc-ument (e.g., a “Sports” question in the Yahoo An-swers dataset could signal “sports” in a few sen-tences, any one of which suffices to correctly cate-gorize it). However, given that these results are forthe HAN models, which typically compute atten-tion over ten or fewer sentences, this is surprising.

Altogether, examining importance from asingle-weight angle paints a tentatively positivepicture of attention’s interpretability, but alsoraises several questions about the many caseswhere the difference in impacts between i∗ and ris almost identical (i.e., ∆JS values close to 0 orthe many cases where neither i∗ nor r cause a de-cision flip). To answer these questions, we requiretests with a broader scope.

5 Importance of Sets of AttentionWeights

Often, we care about determining the collectiveimportance of a set of components I ′. To addressthat aspect of attention’s interpretability and closegaps left by single-weight tests, we introduce teststo determine how multiple attention weights per-form together as importance predictors.

5.1 Multi-Weight Tests

For a hypothesized ranking of importance, such asthat implied by attention weights, we would ex-pect the items at the top of that ranking to func-tion as a concise explanation for the model’s deci-sion. The less concise these explanations get, andthe farther down the ranking that the items trulydriving the model’s decision fall, the less likely itbecomes for that ranking to truly describe impor-tance. In other words, we expect that the top items

5We see this pattern especially strongly for FLANs (seeAppendix), which is unsurprising since I is all words in theinput text, so most attention weights are very small.

in a truly useful ranking of importance would com-prise a minimal necessary set of information formaking the model’s decision.

The idea of a minimal set of inputs necessaryto uphold a decision is not new; Li et al. (2016)use reinforcement learning to attempt to constructsuch a minimal set of words, Lei et al. (2016) trainan encoder to constrain the input prior to clas-sification, and much of the work that has beendone on extractive summarization takes this con-cept as a starting point (Lin and Bilmes, 2011).However, such work has focused on approximat-ing minimal sets, instead of evaluating the abilityof other importance-determining “shortcuts” (suchas attention weight orderings) to identify them.Nguyen (2018) leveraged the idea of minimal setsin a much more similar way to our work, compar-ing different input importance orderings.

Concretely, to assess the validity of an impor-tance ranking method (e.g., attention), we beginerasing representations from the top of the rank-ing downward until the model’s decision changes.Ideally, we would then enumerate all possiblesubsets of that instance’s components, observewhether the model’s decision changed in responseto removing each subset, and then report whetherthe size of the minimal decision-flipping subsetwas equal to the number of items that had neededto be removed to achieve a decision flip by follow-ing the ranking. However, the exponential num-ber of subsets for any given instance’s sequence ofcomponents (word or sentence representations, inour case) makes such a strategy computationallyprohibitive, and so we adopt a different approach.

Instead, in addition to our hypothesized impor-tance ranking (attention weights), we consider al-ternative rankings of importance; if, using those,we repeatedly discover cases where removing asmaller subset of items would have sufficed tochange the decision, this signals that our candidateranking is a poor indicator of importance.

5.2 Alternative Importance Rankings

Exhaustively searching the space of componentsubsets would be far too time-consuming in prac-tice, so we introduce three other ranking schemes.

The first is to randomly rank importance. Weexpect that this ranking will perform quite poorly,but it provides a point of comparison by whichto validate that ranking by descending attentionweights is at least somewhat informative.

Page 6: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

The second ranking scheme, inspired by Li et al.(2015) and Feng et al. (2018), is to order the at-tention weights by the gradient of the decisionfunction with respect to each calculated attentionweight, in descending order. Since each of thedatasets on which we evaluate has either five orten output classes, we take the decision functiongiven a real-valued model output vector to be

d(x) =exp (maxi (xi))∑

i expxi.

Unlike the last two proposed rankings, our thirdranking scheme uses attention weights, but sup-plements them with information about the gradi-ent. For this ranking, we multiply each of ourcalculated gradients from our previous proposedranking scheme by their corresponding attentionweight magnitude. Under this ordering, attendeditems that have both a high attention weight anda high calculated gradient with respect to their at-tention weight will be ranked most important.

We introduce these last two rankings as an at-tempt to discover smaller sets not produced by theattention weight ranking. Note, however, that westill do not take either as a gold-standard indicatorof importance to the model, as with the gradient inRoss et al. (2017) and Melis and Jaakkola (2018),but merely as an alternative ordering method. The“gold standard” in our case would be the minimalset of attention weights to zero out for the decisionto change, which none of our ordering methodswill necessarily find for a particular instance.

5.3 Instances Excluded from Analysis

In cases where removing all but one input to theattention layer still does not produce a decisionflip, we finish the process of removing compo-nents by removing the final representation and re-placing the output of the attention layer with an ar-bitrary vector; we use the zero vector for our tests.Even so, since every real-valued vector output bythe attention layer is mapped to an output distribu-tion, removing this final item will still not changethe classification decision for instances that themodel happened to originally map to that sameclass. We exclude such instances for which thedecision never changed from all subsequent anal-yses.

We also set aside any test instances with a se-quence length of 1 for their final attention layer, asthere is only one possible ordering for such cases.

5.4 Attention Does Not Optimally DescribeModel Decisions

Examining our results in Figure 5, we immediatelysee that ranking importance by descending atten-tion weights is not optimal for our models withencoders. While removing intermediate represen-tations in decreasing order by attention weightsoften leads to a decision flip faster than a ran-dom ranking, it also clearly falls short of match-ing (or even approaching) the decision-flipping ef-ficiency of either the gradient ordering or gradient-attention-product ordering in many cases.

In addition, although the product-based rankingoften (but not always) requires slightly fewer re-moved items than the gradient ranking, we see thatthe purely gradient-based ranking ignoring atten-tion magnitudes comes quite close to it, far outper-forming purely attention-based orderings. For tenof our 16 models with encoders, removing by gra-dient found a smaller decision-flipping set of itemsthan attention for over 50% of instances in thatmodel’s test set, with that percentage often beingmuch higher. In fact, for every model with an en-coder that we tested, there were at least 1.6 timesas many test instances where the purely gradient-based ranking managed a decision flip faster thanthe attention-based ranking than vice versa.

We do not claim that ranking importance by ei-ther descending gradients or descending gradient-attention products is optimal, but in many casesthey discover much smaller decision-flipping setsof items than attention weights. Therefore, rank-ing representations in descending order by atten-tion weight clearly fails to uncover a minimal setof decision-flipping information much of the time,which is a warning sign that we should be skepti-cal of trusting groups of attention weight magni-tudes as importance indicators.

5.5 Decision Flips Often Occur Late

For all ordering schemes we tried, we were struckby the large fraction of items that had to be re-moved to achieve a decision flip in many models.This is slightly less surprising for the HANs, asthey compute attention over shorter sequences ofsentences (see Table 1). For the FLAN models,though, this result is highly unexpected. The se-quences across which FLANs compute attentionare usually hundreds of tokens in length, meaningmost attention weights will likely be minuscule.

The distributions of tokens removed by our dif-

Page 7: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Figure 5: The distribution of fractions of items removed before first decision flips on three model architecturesunder different ranking schemes. Boxplot whiskers represent the highest/lowest data point within 1.5 IQR of thehigher/lower quartile, and dataset names at the bottom apply to their whole column. In several of the plots, themedian or lower quartile aren’t visible; in these cases, the median/lower quartile is either 1 or very close to 1.

ferent orderings that we see for the FLANrnns inFigure 5 are therefore remarkably high, especiallygiven that all of our classification tasks have atleast five output classes. We also note that dueto the exponential nature of the softmax, softmaxattention distributions typically contain only a fewhigh-weighted items before the calculated weightsbecome quite small, which can be misleading. Inmany cases, flipping the model’s original deci-sion requires digging deep into the small attentionweights, with the high-weighted components notactually being the reason for the decision.

For several of our models, especially theFLANs (which typically compute attention overhundreds of tokens), this fact is concerning froman explainability perspective. Lipton (2016) de-scribes a model as “transparent” if “a person cancontemplate the entire model at once.” Applyingthis insight to the explanations suggested by at-tention, if an explanation rests on simultaneouslyconsidering hundreds of attended tokens necessaryfor a decision– even if that set were minimal—thatwould still raise serious transparency concerns.

5.6 Effects of Contextualization Scope onAttention’s Interpretability

One last question we consider is whether the largenumber of items that are removed before deci-sion flips can be explained in part by the scope ofeach model’s contextualization. In machine trans-lation, prior work has observed that recurrent en-coders over a full sequence can “shift” tokens’ sig-nal in ways that cause subsequent attention lay-ers to compute unintuitive off-by-one alignments(Koehn and Knowles, 2017). We hypothesize thatin our text classification setting, the bidirectionalrecurrent structure of the HANrnn and FLANrnnencoders might instead be redistributing operativesignal from a few informative input tokens acrossmany others’ contextualized representations.

Comparing the decision flip results for theFLANconvs in Figure 5 to those for the FLAN-rnns supports this theory. We notice decisionflips happening much faster than for either of theRNN-based model architectures, indicating thatthe biRNN effectively does learn to widely redis-

Page 8: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Figure 6: The distribution of fractions of items removed before decision flips on the encoderless model architec-tures under different ranking schemes. The Amazon FLANnoenc results have a very long tail; using the legend’sorder of rankings, the percentage of test instances with a removed fraction above 0.50 for that model is 12.4%,2.8%, 0.9%, and 0.5%, respectively.

tribute the classification signal. In contrast, theconvolutional encoders only allow contextualiza-tion with respect to an input token’s two neighborsto either side. We see similar results when compar-ing the two HAN architectures, albeit much moreweakly (see Figure 10 in Appendix A.2); this islikely due to the smaller number of tokens beingcontextualized by the HANs (sentence representa-tions instead of words), so that contextualizationwith respect to a token’s close neighbors encom-passes a much larger fraction of the full sequence.

We see this difference even more strongly whenwe compare to the encoderless model architec-tures, as shown in Figure 6. Compared to bothother model architectures, we see the fraction ofnecessary items to erase for flipping the decisionplummet. We also see random orderings mostly dobetter than before, indicating more brittle decisionboundaries, especially on the Amazon dataset.6 Inthis situation, we see attention magnitudes gen-erally indicate importance on par with (or betterthan) gradients, but that the product-based order-

6This is likely due to the fact that with no contextualiza-tion, the final attended representations are just a linear com-bination of the input embeddings, so the embeddings them-selves are responsible for learning to directly encode a deci-sion. Since Amazon has the largest ratio of documents (whichprobably vary in their label) to unique word embeddings bya factor of more than two times any other dataset’s, and thefinal attended representations in the FLANnoencs are unag-gregated word embeddings, it stands to reason that the lackof encoders would be a much bigger obstacle in its case.

ing is still often a more efficient explanation.While these differences themselves are not

an argument against attention’s interpretability,they highlight the distinction between attention’sweighting of intermediate, contextualized repre-sentations and the model’s use of the original inputtokens themselves. Our RNN-based models’ abil-ity to maintain their original decision well past thepoint at which models using only local or no con-text have lost the signal driving their original de-cisions confirms that attention weights for a con-textualized representation do not necessarily mapneatly back to the original tokens. This might atleast partly explain the striking near-indifferenceof the model’s decision to the contributions ofparticular contextualized representations in bothour RNN-based models and in Jain and Wallace(2019), who also use recurrent encoders.

However, the results from almost all modelscontinue to support that ranking importance byattention is still not optimal; our non-random al-ternative rankings still uncover many cases wherefewer items could be removed to achieve a deci-sion flip than the attention weights imply.

6 Limitations

There are important limitations to the work we de-scribe here, perhaps the most important of which isour focus on text classification. By choosing to fo-

Page 9: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

cus on this task, we use the fact that decision flipsare often not trivially achieved to ground our judg-ments of importance in model decision changes.However, for a task with a much larger outputspace (such as language modeling or machinetranslation) where almost anything might flip thedecision, decision flips are likely too coarse a sig-nal to identify meaningful differences. Determin-ing an analogous informative threshold in changesto model outputs would be key to expanding thissort of analysis to other groups of models.

A related limitation is our reliance in many ofthese tests on a fairly strict definition of impor-tance tied to the output’s argmax; an alternativedefinition of importance might assert that the high-est attention weights should identify the most in-fluential representations in pushing towards anyoutput class, not just the argmax. Two of thecore challenges that would need to be solved totest for how well attention meets this relaxed cri-terion would be meaningfully evaluating a sin-gle attended item’s “importance” to multiple out-put classes for comparison to other attended itemsand, once again, determining what would truly in-dicate being “most influential” in the absence ofdecision flips as a guide to the output space.

Also, while we explore several model architec-tures in this work, there exist other attention func-tions such as multi-headed and scaled dot-product(Vaswani et al., 2017), as well as cases where asingle attention layer is responsible for producingmore than one attended representation, such as inself-attention (Cheng et al., 2016). These vari-ants could have different interpretability proper-ties. Likewise, we only evaluate on final layers ofattention here; in large models, lower-level layersof attention might learn to work differently.

7 Related and Future Work

We have adopted an erasure-based approach toprobing the interpretability of computed attentionweights, but there are many other possible ap-proaches. For example, recent work has focusedon which training instances (Koh and Liang, 2017)or which human-interpretable features were mostrelevant for a particular decision (Ribeiro et al.,2016; Arras et al., 2016). Others have explored al-ternative ways of comparing the behavior of pro-posed explanation methods (Adebayo et al., 2018).Yet another line of work focuses on aligning mod-els with human feedback for what is interpretable

(Fyshe et al., 2015; Subramanian et al., 2017),which could refine our idea of what defines a high-quality explanation derived from attention.

Finally, another direction for future work wouldbe to extend the importance-ranking comparisonsthat we deploy here for evaluation purposes into amethod for deriving better, more informative rank-ings, which in turn could be useful for the devel-opment of new, more interpretable models.

8 Conclusion

It is frequently assumed that attention is a toolfor interpreting a model, but we find that atten-tion does not necessarily correspond to impor-tance. In some ways, the two correlate: comparingthe highest attention weight to a lower weight, thehigh attention weight’s impact on the model is of-ten larger. However, the picture becomes bleakerwhen we consider the many cases where the high-est attention weights fail to have a large impact.Examining these cases through multi-weight tests,we see that attention weights often fail to iden-tify the sets of representations most important tothe model’s final decision. Even in cases whenan attention-based importance ranking flips themodel’s decision faster than an alternative rank-ing, the number of zeroed attended items is oftentoo large to be helpful as an explanation. We alsosee a marked effect of the contextualization scopepreceding the attention layer on the number of at-tended items affecting the model’s decision; whileattention magnitudes do seem more helpful in un-contextualized cases, their lagging performance inretrieving decision rationales elsewhere is causefor concern. What is clear is that in the settings wehave examined, attention is not an optimal methodof identifying which attended elements are respon-sible for an output. Attention may yet be inter-pretable in other ways, but as an importance rank-ing, it fails to explain model decisions.

Acknowledgments

This research was supported in part by a grantfrom the Allstate Corporation; findings do not nec-essarily represent the views of the sponsor. Wethank R. Andrew Kreek, Paul Koester, KourtneyTraina, and Rebecca Jones for early conversa-tions leading to this work. We also thank OmerLevy, Jesse Dodge, Sarthak Jain, Byron Wallace,and Dan Weld for helpful conversations, and ouranonymous reviewers for their feedback.

Page 10: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

References

Julius Adebayo, Justin Gilmer, Michael Muelly, IanGoodfellow, Moritz Hardt, and Been Kim. 2018.Sanity checks for saliency maps. In Advances inNeural Information Processing Systems.

Leila Arras, Franziska Horn, Gregoire Montavon,Klaus-Robert Muller, and Wojciech Samek. 2016.Explaining predictions of non-linear classifiers inNLP. arXiv preprint arXiv:1606.07298.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofthe International Conference on Learning Represen-tations.

Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.Long short-term memory-networks for machinereading. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing.

Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexan-der J. Smola, Jing Jiang, and Chong Wang. 2014.Jointly modeling aspects, ratings and sentiments formovie recommendation (JMARS). In Proceedingsof the ACM SIGKDD International Conference onKnowledge Ciscovery and Data mining.

Yanzhuo Ding, Yang Liu, Huanbo Luan, and MaosongSun. 2017. Visualizing and understanding neuralmachine translation. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers).

Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer,Pedro Rodriguez, and Jordan Boyd-Graber. 2018.Pathologies of neural models make interpretationdifficult. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing.

Alona Fyshe, Leila Wehbe, Partha P. Talukdar, BrianMurphy, and Tom M. Mitchell. 2015. A composi-tional and interpretable semantic space. In Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies.

Reza Ghaeini, Xiaoli Z. Fern, and Prasad Tadepalli.2018. Interpreting recurrent and attention-basedneural models: a case study on natural language in-ference. arXiv preprint arXiv:1808.03894.

Amirata Ghorbani, Abubakar Abid, and James Zou.2017. Interpretation of Neural Networks is Fragile.arXiv preprint arXiv:1710.10547.

Ivan Habernal, Henning Wachsmuth, Iryna Gurevych,and Benno Stein. 2018. Before Name-calling:Dynamics and Triggers of Ad Hominem Fal-lacies in Web Argumentation. arXiv preprintarXiv:1802.06613.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances in Neu-ral Information Processing Systems.

Sarthak Jain and Byron C. Wallace. 2019. Attention isnot Explanation. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers).

Yangfeng Ji and Noah A. Smith. 2017. Neural dis-course structure for text categorization. In Proceed-ings of the Annual Meeting of the Association forComputational Linguistics.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Philipp Koehn and Rebecca Knowles. 2017. Six chal-lenges for neural machine translation. In Pro-ceedings of the First Workshop on Neural MachineTranslation.

Pang Wei Koh and Percy Liang. 2017. Understandingblack-box predictions via influence functions. arXivpreprint arXiv:1703.04730.

Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim.2017. Interactive Visualization and Manipulationof Attention-based Neural Machine Translation. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations, pages 121–126.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing.

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky.2015. Visualizing and understanding neural modelsin NLP. arXiv preprint arXiv:1506.01066.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Un-derstanding neural networks through representationerasure. arXiv preprint arXiv:1612.08220.

Hui Lin and Jeff Bilmes. 2011. A class of submodu-lar functions for document summarization. In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics: Human LanguageTechnologies-Volume 1.

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. arXiv preprint arXiv:1703.03130.

Page 11: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Zachary C Lipton. 2016. The mythos of model inter-pretability. arXiv preprint arXiv:1606.03490.

Yang Liu and Mirella Lapata. 2018. Learning struc-tured text representations. Transactions of the Asso-ciation of Computational Linguistics, 6:63–75.

Minh-Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025.

Andre Martins and Ramon Astudillo. 2016. From soft-max to sparsemax: A sparse model of attention andmulti-label classification. In International Confer-ence on Machine Learning.

David Alvarez Melis and Tommi Jaakkola. 2018. To-wards robust interpretability with self-explainingneural networks. In Advances in Neural InformationProcessing Systems.

Dong Nguyen. 2018. Comparing automatic and humanevaluation of local explanations for text classifica-tion. In Proceedings of the Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers).

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. “Why should I trust you?”: Ex-plaining the predictions of any classifier. In Pro-ceedings of the ACM SIGKDD International Con-ference on Knowledge Discovery and Data mining.

Andrew Slavin Ross, Michael C. Hughes, and FinaleDoshi-Velez. 2017. Right for the right reasons:Training differentiable models by constraining theirexplanations. In Proceedings of the InternationalJoint Conference on Artificial Intelligence.

Anant Subramanian, Danish Pruthi, Harsh Jhamtani,Taylor Berg-Kirkpatrick, and Eduard Hovy. 2017.SPINE: Sparse interpretable neural embeddings.arXiv preprint arXiv:1711.08792.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-mar as a foreign language. In Advances in NeuralInformation Processing Systems.

Yequan Wang, Minlie Huang, Li Zhao, et al. 2016.Attention-based LSTM for aspect-level sentimentclassification. In Proceedings of the 2016 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 606–615.

Fan Yang, Arjun Mukherjee, and Eduard Dragut. 2017.Satirical news detection and analysis using attentionmechanism and linguistic features. arXiv preprintarXiv:1709.01189.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classification. InProceedings of the Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Advances in Neural Information Pro-cessing Systems.

Page 12: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

A Appendices

A.1 Model Hyperparameters andPerformance

We lowercased all tokens during preprocessingand used all hyperparameters specified in (Yanget al., 2016), except for those related to the opti-mization algorithm or, in the case of the convo-lutional or no-encoder models, the encoder. Foreach convolutional encoder, we trained two con-volutions: one sweeping over five tokens, and onesweeping over three. As the output representationof token x, we then concatenated the outputs of thefive-token and three-token convolutions centeredon x. Unless otherwise noted, to train each model,we used Adam (Kingma and Ba, 2014) with gra-dient clipping of 10.0 and a patience value of 5,so we would stop training a model if five epochselapsed without any improvement in validation setaccuracy. In addition, for each model, we speci-fied a learning rate for training, and dropout be-fore each encoder layer (or attention layer, for theencoderless models) and also within the classifi-cation layer. For the HAN models, these are thevalues we used:

• Yahoo Answers HANrnn, Yahoo AnswersHANconv

– Pre-sentence-encoder dropout: 0.4445– Pre-document-encoder dropout: 0.2202– Classification layer dropout: 0.3749– Learning rate: 0.0004

• IMDB HANrnn

– Pre-sentence-encoder dropout: 0.4445– Pre-document-encoder dropout: 0.2202– Classification layer dropout: 0.2457– Learning rate: 0.0004

• Amazon HANrnn, Amazon HANconv

– Pre-sentence-encoder dropout: 0.6– Pre-document-encoder dropout: 0.2– Classification layer dropout: 0.4– Learning rate: 0.0002

• Amazon HANnoenc

– Pre-sentence-encoder dropout: 0.6– Pre-document-encoder dropout: 0.2– Classification layer dropout: 0.4– Learning rate: 0.0002

– Patience: 10

• Yelp HANrnn, Yelp HANconv

– Pre-sentence-encoder dropout: 0.7– Pre-document-encoder dropout: 0.1– Classification layer dropout: 0.7– Learning rate: 0.0001

• Yelp HANnoenc

– Pre-sentence-encoder dropout: 0.7– Pre-document-encoder dropout: 0.1– Classification layer dropout: 0.7– Learning rate: 0.0001– Patience: 10

• Yahoo Answer HANnoenc

– Pre-sentence-encoder dropout: 0.4445– Pre-document-encoder dropout: 0.2202– Classification layer dropout: 0.3749– Learning rate: 0.0004– Patience: 10

• IMDB HANconv

– Pre-sentence-encoder dropout: 0.4445– Pre-document-encoder dropout: 0.2202– Classification layer dropout: 0.2457– Learning rate: 0.0004

• IMDB HANnoenc

– Pre-sentence-encoder dropout: 0.4445– Pre-document-encoder dropout: 0.2202– Classification layer dropout: 0.2457– Learning rate: 0.0004– Patience: 10

For the FLAN models, these are the values weused:

• Yahoo Answers FLANrnn, Yahoo AnswersFLANconv

– Pre-document-encoder dropout: 0.4445– Classification layer dropout: 0.4457– Learning rate: 0.0004

• IMDB FLANrnn, IMDB FLANconv

– Pre-document-encoder dropout: 0.4445– Classification layer dropout: 0.3457– Learning rate: 0.0004

Page 13: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Dataset HANrnn HANconv HANnoenc FLANrnn FLANconv FLANnoencYahoo Answers 74.6 72.8 73.1 75.5 73.1 72.3IMDB 50.3 48.9 46.1 49.1 48.2 45.4Amazon 56.9 55.3 51.2 56.6 54.4 50.2Yelp 63.0 61.0 58.6 62.3 60.7 58.2

Table 3: Classification accuracy of the different trained models on their respective test sets

• Amazon FLANrnn, Amazon FLANconv

– Pre-document-encoder dropout: 0.6– Classification layer dropout: 0.4– Learning rate: 0.0002

• Amazon FLANnoenc

– Pre-document-encoder dropout: 0.6– Classification layer dropout: 0.4– Learning rate: 0.0002– Patience: 10

• Yelp FLANrnn, Yelp FLANconv

– Pre-document-encoder dropout: 0.7– Classification layer dropout: 0.7– Learning rate: 0.0001

• Yelp FLANnoenc

– Pre-document-encoder dropout: 0.7– Classification layer dropout: 0.7– Learning rate: 0.0001– Patience: 10

• Yahoo Answers FLANnoenc

– Pre-document-encoder dropout: 0.4445– Classification layer dropout: 0.4457– Learning rate: 0.0004– Patience: 10

• IMDB FLANnoenc

– Pre-document-encoder dropout: 0.4445– Classification layer dropout: 0.3457– Learning rate: 0.0004– Patience: 10

Trained model classification accuracies are re-ported in Table 3. We note that our IMDB data andYelp data are different sets of reviews from thoseused by Yang et al. (2016), so our reported per-formances are not directly comparable to theirs.

We were unable to reach a comparable perfor-mance for the Amazon dataset (and Yelp dataset,although different) to that in (Yang et al., 2016).We suspect that this is due to not pretrainingthe word2vec embeddings used by the model forlong enough, combined with memory limitationson our hardware that necessitated decreasing ourbatch size in many cases. However, as noted insection 3, the analysis that we perform does notdepend on model accuracy. It’s also worth notingthat for the datasets for which we are able to getresults that either pass or come close to the accura-cies listed in the original HAN paper, the patternswe see in the results for the tests that we run arethe same as the patterns that we see for the others.

A.2 Full Sets of Plots

Here we include the full sets of result plots for allmodels for all tests we describe in the paper, inorder of appearance.

In Figure 7, we see that the majority of ∆JSvalues continue to fall above 0, and that mostare still close to 0. One point not stated in thepaper, though, is that the upswing in ∆JS val-ues as the difference between i∗’s weight and arandomly chosen weight increases tends to occurslightly earlier for models with less contextualiza-tion, implying that the improving efficiency of theattention-based ranking at flipping the decision ascontextualization scope shrinks is also reflected insingle-weight test results.

Looking at where negative ∆JS values tend tooccur in Figure 8, we once again see that theytend to cluster around cases where the differencebetween the highest and randomly chosen atten-tion weights is close to 0. There are some ex-ceptions, however; perhaps the most obvious arethe fat tails of these counts for the Yahoo AnswerHAN models. Considering the highest-attention-weight ranking of importance for all Yahoo An-swers HAN models in Figure 10 struggle in flip-ping the decision quickly, it may be that attentionis less helpful than usual in identifying importance

Page 14: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

in its case, which could explain this discrepancy.In Figure 9, we list contingency tables for all i∗-

versus-random single-weight decision-flip tests.We continue to see higher values overall in ourblue cells than orange, as described in section4.2. The most general change we notice acrossall the tables is that in the encoderless case, thereare more test instances (often many more) whereat least one of i∗ or our random attended itemflipped the decision than for any other architec-ture, except in the case of the Yahoo AnswersFLAN. Thinking about why this might be, we re-call that in the encoderless case, word embeddingsare much more directly responsible for encoding adecision. Yahoo Answers is our only topic classi-fication dataset, where keywords like “computer”or “basketball” might be much clearer indicatorsof a topic than, say, “like” or “love” would be in-dicators of a rating of 8 versus 9. This likely leadsto much less certain decisions being encoded inthe word embeddings of the non-Yahoo Answersdatasets. For all other models, and in the casewhere potentially contradictory Yahoo Answersword embeddings are blended together before thefinal layer of attention (its HANnoenc), it is likelythat decisions are simply more brittle overall.

Finally, in Figure 10, we include the full setof fraction-removed distributions for the first deci-sion flips reached under the different rankings weexplored.

A.3 Additional Tests

Besides the tests we describe in the main paper,some of the other tests that we ran provide addi-tional insights into our results. We briefly describethose here.

In Figure 11, we provide the distributions ofthe original attention probability distributions thatwere zeroed at the point when different rankingschemes achieved their first decision flips. (Equiv-alently, these are the distributions of the sums ofthe zeroed attention weights described in Figure10, only without repeated normalization.) We in-clude these results to give a sense of which at-tention magnitudes the different rankings typicallyplace towards the top. We notice that this proba-bility mass required to change a decision is oftenquite high, which is unsurprising for the attention-based ranking, given that it frequently requires re-moving many items to flip decisions and attentiondistributions tend to have just a few high weights.

Besides that, the main takeaway that we seehere is that for most models, the distribution of at-tention probability masses zeroed by our gradient-based ranking or our product-based ranking is of-ten shifted down by around 0.25 or more com-pared to the corresponding attention probabilitymass distribution for the attention-based ranking,which is a fairly large difference. This would seemto imply that these alternative rankings (whichusually flip decisions faster) tend to differ in rel-atively substantial ways from the rankings sug-gested by the pure attention weights, not just inthe long tail of their orderings, which is anotherwarning sign against attention’s interpretability.

The final set of tests that we include in Fig-ures 12 and 13 consist of rerunning our single-weight decision-flip tests on the single “most im-portant” attention weights in their respective atten-tion distributions as suggested by our alternativerankings (gradient-based and product-based rank-ings) instead of attention magnitudes. These re-sults serve two functions: first, they imply stillmore information about when the top weight sug-gested by an alternative, faster-decision-flippingranking differs from the top attention weight. In-tuitively, if we observe large differences betweenthe sum of the “yes” row for one contingency tableand the “yes” rows for the other rankings’ tableson that same model, this is likely due to differ-ences in the frequencies with which the highest-ranked items achieve a decision flip, indicatingdifferences in highest-ranked items (“likely” be-cause of the noise added by the random sampling).

The second piece of information that these testsprovide is a lower bound (via the sum of the “yes”rows) for the number of cases where rankings flipa decision as quickly as possible (i.e., in the firstitem). For context, the sum of the “yes” row ishigher than the corresponding sum in Figure 9 forall contingency tables using our product-based or-dering. For the gradient-based ordering, however,this sum is actually lower than for the attention-based ranking’s tables in 14 out of our 24 models.This tells us that our gradient-based method of-ten finds fewer single-item ways of flipping deci-sions than the attention-based ranking, so in orderto achieve its more efficient overall distribution offlips that we see for many models in Figure 10,it must usually flip decisions faster than attentionin cases where both its ranking and the attention-based ranking require multiple removed weights.

Page 15: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Figure 7: Differences in attention weight magnitude plotted against ∆JS for all datasets and architectures

Page 16: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Figure 8: Counts of negatives ∆JS values grouped by the difference in their corresponding attention weights forall datasets and architectures.

Page 17: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.5 8.7 Yes 2.2 12.2No 1.3 89.6 No 1.4 84.2

Amazon YelpYes No Yes No

Yes 2.7 7.6 Yes 1.5 8.9

Rem

ovei∗

:Dec

isio

nfli

p?

No 2.7 87.1 No 1.9 87.7

(a) HANrnns

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.3 9.0 Yes 0.1 14.5No 0.3 90.3 No 0.1 85.3

Amazon YelpYes No Yes No

Yes 0.6 7.3 Yes 0.3 8.0

Rem

ovei∗

:Dec

isio

nfli

p?

No 0.5 91.6 No 0.3 91.4

(b) FLANrnns

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 1.2 3.3 Yes 3.4 14.7No 2.1 93.4 No 2.6 79.3

Amazon YelpYes No Yes No

Yes 4.4 13.1 Yes 3.3 11.1

Rem

ovei∗

:Dec

isio

nfli

p?

No 5.4 77.0 No 4.0 81.6

(c) HANconvs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.4 6.8 Yes 0.2 17.3No 0.8 91.9 No 0.3 82.2

Amazon YelpYes No Yes No

Yes 1.3 6.8 Yes 0.8 12.1

Rem

ovei∗

:Dec

isio

nfli

p?

No 1.6 90.2 No 0.7 86.3

(d) FLANconvs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 2.5 18.2 Yes 6.7 34.7No 3.7 75.7 No 3.2 55.4

Amazon YelpYes No Yes No

Yes 13.8 25.8 Yes 8.4 18.0

Rem

ovei∗

:Dec

isio

nfli

p?

No 6.0 54.3 No 5.2 68.4

(e) HANnoencs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.7 5.8 Yes 0.3 27.6No 1.1 92.4 No 0.3 71.8

Amazon YelpYes No Yes No

Yes 2.5 24.8 Yes 1.4 14.1

Rem

ovei∗

:Dec

isio

nfli

p?

No 1.6 71.0 No 1.3 83.2

(f) FLANnoencs

Figure 9: Using the definition of i∗ given in section 4 (the highest-attention-weight attended item) and comparingto a different randomly selected attended item, these were the percentages of test instances that fell into eachdecision-flip indicator variable category for each of the four test sets on all models. Since we require our randomitem not to be i∗, we exclude any instances with a final sequence length of 1 (one sentence for the HANs, one wordfor the FLANs) from analysis.

Page 18: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Figure 10: Distribution of fraction of attention weights that had to be removed by different ranking schemesto change each model architecture’s decisions for each of the four datasets. The different rankings (aside from“Attention”, which corresponds to the attention weight magnitudes in descending order) are described in section5.2.

Page 19: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Figure 11: Distribution of probability masses that had to be removed by different ranking schemes to change eachmodel architecture’s decisions for each of the four datasets. While we do not discuss these in the paper due to spaceconstraints, we notice that in most cases, a high fraction of the original attention distribution’s probability massmust be zeroed before the (renormalized) modified attended representation results in a changed decision using theAttention ranking.

Page 20: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 1.6 7.9 Yes 2.8 13.7No 0.2 90.3 No 0.6 82.9

Amazon YelpYes No Yes No

Yes 4.9 10.2 Yes 3.0 10.0

Rem

ovei∗ g

:Dec

isio

nfli

p?

No 0.4 84.5 No 0.4 86.6

(a) HANrnns

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.5 7.2 Yes 0.1 10.9No 0.2 92.2 No 0.1 88.9

Amazon YelpYes No Yes No

Yes 0.9 7.1 Yes 0.4 7.4

Rem

ovei∗ g

:Dec

isio

nfli

p?

No 0.2 91.8 No 0.1 92.1

(b) FLANrnns

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 2.8 3.0 Yes 5.0 17.8No 0.3 93.8 No 1.2 75.9

Amazon YelpYes No Yes No

Yes 9.1 16.7 Yes 6.2 13.2

Rem

ovei∗ g

:Dec

isio

nfli

p?

No 0.8 73.5 No 1.0 79.6

(c) HANconvs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.9 12.7 Yes 0.4 16.6No 0.3 86.1 No 0.2 82.8

Amazon YelpYes No Yes No

Yes 2.0 15.4 Yes 1.2 11.3

Rem

ovei∗ g

:Dec

isio

nfli

p?

No 1.0 81.6 No 0.4 87.2

(d) FLANconvs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 5.3 15.7 Yes 7.4 30.5No 0.8 78.2 No 2.9 59.3

Amazon YelpYes No Yes No

Yes 18.1 32.0 Yes 11.8 26.2

Rem

ovei∗ g

:Dec

isio

nfli

p?

No 1.7 48.2 No 1.6 60.4

(e) HANnoencs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.7 4.1 Yes 0.3 11.1No 1.1 94.1 No 0.4 88.2

Amazon YelpYes No Yes No

Yes 2.9 17.0 Yes 1.2 6.2

Rem

ovei∗ g

:Dec

isio

nfli

p?

No 1.2 78.8 No 1.4 91.2

(f) FLANnoencs

Figure 12: Let i∗g be the highest-ranked attended item using our purely gradient-based ranking of importancedescribed in section 5.2. We rerun our single-weight decision flip tests using this new i∗g , comparing to a differentrandomly selected attended item. These were the percentages of test instances that fell into each decision-flipindicator variable category for each of the four test sets on all models. Since we require our random item not tobe i∗g , we exclude any instances with a final sequence length of 1 (one sentence for the HANs, one word for theFLANs) from analysis.

Page 21: y Paul G. Allen School of Computer Science & Engineering,Is Attention Interpretable? Sofia Serrano Noah A. Smithy Paul G. Allen School of Computer Science & Engineering, University

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 1.3 9.5 Yes 2.9 15.5No 0.5 88.7 No 0.6 81.0

Amazon YelpYes No Yes No

Yes 4.5 10.5 Yes 2.7 11.3

Rem

ovei∗ p

:Dec

isio

nfli

p?

No 0.8 84.2 No 0.8 85.3

(a) HANrnns

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 0.6 13.2 Yes 0.2 18.3No 0.0 86.2 No 0.1 81.5

Amazon YelpYes No Yes No

Yes 0.9 10.5 Yes 0.4 10.6

Rem

ovei∗ p

:Dec

isio

nfli

p?

No 0.1 88.5 No 0.1 88.9

(b) FLANrnns

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 2.8 5.1 Yes 5.0 20.5No 0.4 91.7 No 0.9 73.6

Amazon YelpYes No Yes No

Yes 8.8 18.2 Yes 6.3 16.0

Rem

ovei∗ p

:Dec

isio

nfli

p?

No 1.0 72.0 No 0.8 76.9

(c) HANconvs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 1.1 19.3 Yes 0.5 26.7No 0.0 79.6 No 0.0 72.8

Amazon YelpYes No Yes No

Yes 2.7 32.7 Yes 1.3 20.4

Rem

ovei∗ p

:Dec

isio

nfli

p?

No 0.2 64.3 No 0.2 78.1

(d) FLANconvs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 5.6 23.1 Yes 9.2 42.9No 0.6 70.7 No 0.8 47.1

Amazon YelpYes No Yes No

Yes 18.4 33.9 Yes 11.9 27.8

Rem

ovei∗ p

:Dec

isio

nfli

p?

No 1.5 46.2 No 1.6 58.7

(e) HANnoencs

Remove random: Decision flip?Yahoo IMDB

Yes No Yes NoYes 1.8 19.4 Yes 0.6 36.2No 0.0 78.8 No 0.1 63.1

Amazon YelpYes No Yes No

Yes 3.8 35.9 Yes 2.3 26.3

Rem

ovei∗ p

:Dec

isio

nfli

p?

No 0.3 60.0 No 0.3 71.1

(f) FLANnoencs

Figure 13: Let i∗p be the highest-ranked attended item using our attention-gradient product ranking of importancedescribed in section 5.2. Once again, we rerun our single-weight decision flip tests using this new i∗p, comparingto a different randomly selected attended item. These were the percentages of test instances that fell into eachdecision-flip indicator variable category for each of the four test sets on all models. Since we require our randomitem not to be i∗p, we exclude any instances with a final sequence length of 1 (one sentence for the HANs, one wordfor the FLANs) from analysis.