Crafting Adversarial Attacks on Recurrent Neural Networks ...cs229.stanford.edu/proj2017/final-posters/5148050.pdf · Crafting Adversarial Attacks on Recurrent Neural Networks (RNNs)

02468101214

1 6 11 16 21 26 31

NumberofWordsChanged

HistogramoverAdversarialExamples

CraftingAdversarialAttacksonRecurrentNeuralNetworks(RNNs)MarkAnderson,AndrewBartolo,Pulkit Tandon

{mark01,bartolo,tpulkit}@stanford.edu

Summary Models

Data&Features

IntuitiveBlack-BoxAdversaries•RNNsareusedinavarietyofapplicationstorecognizeandpredictsequentialdata.However,theyarevulnerabletoadversaries; e.g.,acleverly-placedwordmaychangethepredictedsentimentofamovie reviewfrompositivetonegative.•WebuiltNaïveBayes,SVM,andLSTMmodelstopredictmoviereviewsentimentandbuilttwoblack-boxadversaries.WeshowthatNBandSVMaresensitivetotheseattackswhileLSTMsarerelativelyrobust.•Finally,weimplementedarecentJacobian-basedtechniqueforgeneratingadversariesforLSTM,andfoundthatLSTMperformancefallsbelow40%byreplacinganaverageof8.7words.Wealsofoundexampleswheretheclassificationerrorwasbroughtonbyaseemingly-randomword,indicatingthattheLSTMmightnotbetrulylearningsentiment.

Wetrainonapre-labeledsetof12,500positiveand12,500negativemoviereviews,collectedfromIMDb[1].Reviewsaveraged233words.ForcompatibilitywiththeNumPy andTensorFlow inputmodels,SVMandLSTMreviewsarecappedat250words.Westripallpunctuationfromthereviews,butleavestopwords.

Trainingaccuracyvs.#iterations,64- and128-hidden-unitLSTM

TheWord2Vec+LSTMarchitecture[3]• Single-LayerRNNwithLSTMs• LinearSVM• NaïveBayeswithLaplace

Smoothing

PCArunoverthedataset.

Features:1. Bag-of-Words

One-hotvector– sizeofthedictionary(400kwords).UsedforNaïveBayesandSVMmodels.

2. WordVectors[2]Pre-determinedembeddingin50-dimensionalspace.UsedforLSTMmodel.

Analysis

JacobianSaliencyMapAdversary[3]

Input:f,�⃗�, 𝐷Algorithm:1. y:=f(�⃗�)2. 𝑥∗ :=�⃗�3. 𝐽' �⃗� 𝑦 =

)*+),⃗

4. whilef(𝑥∗)==y:5. selectawordi insequence𝑥∗6. 𝑤 ∶= 𝑎𝑟𝑚𝑖𝑛5⃗67 𝑠𝑖𝑔𝑛 𝑥∗ − 𝑧 − 𝑠𝑖𝑔𝑛 𝐽' �⃗� 𝑖, 𝑦7. 𝑥∗[𝑖] :=𝑤8. end9. return𝑥∗

�⃗�[2]=movie

−𝑠𝑖𝑔𝑛 𝐽' �⃗� 𝑖, 𝑦

0

𝑥∗[2]�⃗� =“Themovieisterrific”

y=Pos

y=Neg

TruePositive 54 65

TrueNegative 110 62

f:PredictionModelX:ExampleSentenceD:Dictionary

[1]A.Maas,R.Daly,P.Pham,D.Huang,A.Ng,andC.Potts,“LearningWordVectorsforSentimentAnalysis,”InProc.ofthe49thAnnualMeetingoftheAssociationforComputationalLinguistics:HumanLanguageTechnologies,‘06,2011,pp.142-150.[2]A.Deshpande,“SentimentAnalysiswithLSTMs,”Oct.3,2017.[Online].Available:https://github.com/adeshpande3/LSTM-Sentiment-Analysis.[3]N.Papernot,P.McDaniel,A.Swami,andR.Harang.“CraftingAdversarialInputSequencesforRecurrentNeuralNetworks.”Apr.28,2016.

References

HistogramofAdversarialSamplesAverage#WordsChanged:8.7

89.24%

80.98% 69.64%

64.27%

31.66%

14.05%

98.19%

86.51% 81.65% 80.49%

54.34%

32.95%

94.36%

81.77% 79.08%

77.76% 68.17%

59.36%

39.86%

0%

20%

40%

60%

80%

100%

Training Testing(noadversary)

Testing,tack-on Testing,1-strongest

Testing,3-strongest

Testing,5-strongest

Testing,JSMA

ModelAccuracyvs.Adversary

NaïveBayes SVM LSTM

• SVMandNBperformsimilarlytoLSTMonthetestsetwithoutadversary.Thisimpliesthedataiswell-segregated- independentlyseeninPCAplot.

• TheLSTMismostrobusttoourblack-boxadversaries.• Black-boxadversarieswerewordsstronglyassociatedwithsentiment.• Modelaccuraciesfellmonotonicallywithincreasingadversarystrength.• Jacobian-basedmethodsdonotalwayschangethemostpositive/negative

words.Seemingly-randomwordinjectionchangestheprediction,leadingustoquestionwhetherLSTMsareactuallylearningthesentiment;e.g.:Thisexcellentmoviemademecry!→ thisexcellenttsunga telsim grrr cry

• ImplementadeeperLSTMwithmean-poolinglayers• OptimizedmemoryallocationinTensorFlow codeforJSMAmethod• AdversarialtrainingofLSTMnetworkbasedonJSMAadversaries• UseStanfordNLPParsertoautomategrammarchecking

FutureWork

• BasedonNaïveBayes”strongest”words– wordsmostpolarizingtowardpositiveornegativeclassification

• AdversarialWords:• PositiveSway:”edie,”“antwone,”“din,”“gunga,””yokai”• NegativeSway:“boll,”“410,”“uwe,”“tashan,”“hobgoblins”

• Tack-On:replacefirstwordwithrandomadversarialword• NStrongest-Word-Swap:replacereview’sNstrongestword(s)withrandom

adversarialword(s);experimentedforN<=5

Weperformedahyperparameter searchandsettledonanLSTMwithasoftmax outputlayerand64hiddenunits.ForthelinearSVM,wesweptlearningrateandtrieddifferentfeaturesandkernels.TheNaïveBayesmodelismultinomialanduseslog-probabilities.

AccuracyafterJSMA=39.9%

Documents

Crafting Adversarial Attacks on Recurrent Neural Networks ...cs229.stanford.edu/proj2017/final-posters/5148050.pdf · Crafting Adversarial Attacks on Recurrent Neural Networks (RNNs)