1
Not Just a Black Box: Interpretable Deep Learning for Genomics AvanA Shrikumar 1 , Peyton Greenside 2 , Anshul Kundaje 1,3 1 Stanford Computer Science, 2 Stanford Dept. of Biomedical InformaAcs, 3 Stanford GeneAcs Novel algorithm (DeepLIFT) for explaining predicAons of a given deep learning model for parAcular input examples Novel algorithm (MoDISco) for extracAng recurring pamerns (moAf discovery) using a deep learning model Our contribuCons Method: DeepLIFT (Deep Learning Important Features) 1. Alipanahi, B., Delong, A., Weirauch, M., & Frey, B. (2015). PredicAng the sequence specificiAes of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 2. Zhou J, Troyanskaya O. PredicAng effects of noncoding variants with deep learning–based sequence model. Nature Methods. 2015 3. Kelley, D., Snoek, J., & Rinn, J. (2015). Basset: Learning the regulatory code of the accessible genome with deep convoluAonal neural networks. doi:10.1101/028399 4. Heinz S, e. (2016). Simple combinaAons of lineage-determining transcripAon factors prime cis-regulatory elements required for macrophage and B cell idenAAes. 5. Lim LS, e. (2016). The pluripotency regulator Zic3 is a direct acAvator of the Nanog promoter in ESCs. 6. Gagliardi A, e. (2016). A direct physical interacAon between Nanog and Sox2 regulates embryonic stem cell self-renewal. - PubMed - NCBI . Ncbi.nlm.nih.gov. Retrieved 30 January 2016, from hmps://www.ncbi.nlm.nih.gov/ pubmed/23892456 7. Kheradpour, P., & Kellis, M. (2014). SystemaAc discovery and characterizaAon of regulatory moAfs in ENCODE TF binding experiments. Nucleic acids research Results of logisCc regression model trained to predict Nanog binding using the top 3 moCf hits, per moCf, per region Visualizing individual paQern- detectors: DeepBind (Alipanahi et al.) Superior moCf discovery for Nanog PosiCve set: 5,473 reproducible Nanog peaks in H1-ESC from ENCODE NegaCve set: 258,987 H1ESC DNase-seq peaks Method: MoDISco (MoCf Discovery from Importance Scores) i 1 = 0 i 2 = 0 h 1 = max(0, i 1 + 2i 2 + 1) = 1 h 2 = max(0, i 1 + 2i 2 - 1) = 0 y=h 1 +h 2 = 1 i 1 = -1 i 2 = -1 h 1 = max(0, i 1 + 2i 2 + 1) = 0 h 2 = max(0, i 1 + 2i 2 - 1) = 0 y=h 1 +h 2 = 0 Compute behaviour under “reference” Use difference from reference to find importance scores Gradients assign importance of 0 to both inputs in laQer case, as gradient of h 1 and h 2 are 0. Using difference-from- reference, we see h 1 is -1 below its reference value; DeepLIFT assigns an importance of -1/3 to i 1 and -2/3 to i 2 Gata (Rev. Comp.) Gata SPI1 Gata (Rev. Comp.) B-cells Gata1 ChIP-seq peak SPI1 ChIP-seq peak No SPI1 peak No Gata1 ChIP-seq peak Erythroid Reveal context-specific use of regulatory sequence Results (DeepLIFT) Model architecture overview C G A T A A C C G A T A T Learned paQern detectors Input: DNA sequence represented as ones and zeros Later layers build on paQerns of previous layer Accessible in Erythroid Accessible in B-cells Output: Accessible (+1) vs not accessible (0) “Fully connected” layers incorporate all info together A C G T 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 Computer vision All 5 ENCODE Nanog moCfs Canonical HOMER matches to 4 MoDISCo moCfs All 32 de-novo HOMER moCfs Top 4 de-novo HOMER moCfs All 4 MoDISco moCfs LogisCc Regression auROC 0.0 1.0 (a) Obtain per-base importance scores using DeepLIFT (b) Segment to “seqlets” of high importance (c) “Autocomplete” seqlets using DeepLIFT informaCon (d) Compute distances between pairs of seqlets via cross-correlaCon (e) Cluster seqlets using pairwise distances (f) Aggregate clusters Regulatory sequence involves complex hierarchical pamerns that are difficult for exisAng computaAonal methods to model Deep Learning techniques show great promise in this area [1-3] but are considered uninterpretable “Black Boxes”, limiAng their usefulness for making biological discoveries MoCvaCon Goal: learn key regulatory sequences governing hematopoesis Approach: (1) Experimentally idenAfy biochemically acAve regions in different cell-lines during the hematopoiesis lineage (2) Train deep learning model to predict acAvity from seq. (3) Interpret the model to learn key regulatory sequences Example problem Peyton Greenside 4/5 ENCODE Nanog moCfs Corresponding MoDISco moCf Zic3 Sox2 Oct4-Sox2-Nanog Nanog Results (MoDISco) 4 MoCf clusters idenCfied by MoDISco: ++ and -- orientaAon +- and -+ orientaAon Zic3 and Nanog separaAon: Oct-Sox-Nanog and Nanog separaAon: Shuffled Zic3 and Nanog separaAon: Co-binding between Zic3 and Nanog? Fusion moCf from subclustering: Individual examples: Protein-protein interacCon: original scores: 8 scores: 3 masked, 8->3 scores: 6 masked, 8->6 |grad| (simonyan) Guided Backprop gradient* input integrated grads-10 DeepLIFT- RevealCancel Proof-of-concept: morphing an “8” to a 3 or a 6 Deep learning model is trained to recognize handwriQen digits from the MNIST database. Pixels are ranked by difference of importance for original class (eg: 8) and target class (eg: 3 or 6) by different methods. Up to 20% of pixels more important to original class than target class erased. i 1 i 2 y i 1 i 2 <i 1 i 1 – (i 1 -i 2 )=i 2 i 1 i 2 >i 1 i 1 –0=i 1 i 1 i 2 y=i 1 –h 2 h 1 =i 1 -i 2 1 -1 1 -1 y = min(i 1 ,i 2 ) à gradient 0 for either i 1 or i 2 h 2 = max(0, h 1 ) y=i 1 – max(0, i 1 –i 2 ) = min(i 1 ,i 2 ) -6 y=i 1 - max(0, i 1 –i 2 ) = 10 – max(0, 4) = 6 Standard breakdown: 4= (10 from i 1 ) + (-6 from i 2 ) max(0, i 1 -i 2 ) i 1 -i 2 i 1 =10 i 2 =6 +10 Other possible breakdown: 4= (4 from i 1 ) + (0 from i 2 ) max(0, i 1 -i 2 ) i 1 -i 2 i 1 =10 i 2 =6 4 0 Standard breakdown: y = (10 from i 1 ) –[(10 from i 1 ) (6 from i 2 )]= 6 from i 2 Average over both orders: y = (10 from i 1 ) –[(7 from i 1 ) + (-3 from i 2 )] = (3 from i 1 ) + (3 from i 2 ) Average: 4= (7 from i 1 ) + (-3 from i 2 ) i 1 -i 2 Consider i 1 = 10, i 2 =6 By considering different orders for posiCve and negaCve terms, can also improve assignment of importance scores: “AND”/min operaCon:

Not Just a Black Box: Interpretable Deep Learning for ...forum.stanford.edu/events/posterslides/NotJustaBlackBox... · Not Just a Black Box: Interpretable Deep Learning for Genomics

  • Upload
    doanque

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Not Just a Black Box: Interpretable Deep Learning for ...forum.stanford.edu/events/posterslides/NotJustaBlackBox... · Not Just a Black Box: Interpretable Deep Learning for Genomics

PosterPrintSize:Thispostertemplateis36”highby48”wide.Itcanbeusedtoprintanyposterwitha3:4aspectraAo.

Placeholders:ThevariouselementsincludedinthisposterareonesweoCenseeinmedical,research,andscienAficposters.Feelfreetoedit,move,add,anddeleteitems,orchangethelayouttosuityourneeds.Alwayscheckwithyourconferenceorganizerforspecificrequirements.

ImageQuality:YoucanplacedigitalphotosorlogoartinyourposterfilebyselecAngtheInsert,Picturecommand,orbyusingstandardcopy&paste.Forbestresults,allgraphicelementsshouldbeatleast150-200pixelsperinchintheirfinalprintedsize.Forinstance,a1600x1200pixelphotowillusuallylookfineupto8“-10”wideonyourprintedposter.Topreviewtheprintqualityofimages,selectamagnificaAonof100%whenpreviewingyourposter.Thiswillgiveyouagoodideaofwhatitwilllooklikeinprint.Ifyouarelayingoutalargeposterandusinghalf-scaledimensions,besuretopreviewyourgraphicsat200%toseethemattheirfinalprintedsize.Pleasenotethatgraphicsfromwebsites(suchasthelogoonyourhospital'soruniversity'shomepage)willonlybe72dpiandnotsuitableforprinAng.

[Thissidebarareadoesnotprint.]

ChangeColorTheme:Thistemplateisdesignedtousethebuilt-incolorthemesinthenewerversionsofPowerPoint.Tochangethecolortheme,selecttheDesigntab,thenselecttheColorsdrop-downlist.Thedefaultcolorthemeforthistemplateis“Office”,soyoucanalwaysreturntothataCertryingsomeofthealternaAves.

PrinAngYourPoster:Onceyourposterfileisready,visitwww.genigraphics.comtoorderahigh-quality,affordableposterprint.EveryorderreceivesafreedesignreviewandwecandeliverasfastasnextbusinessdaywithintheUSandCanada.Genigraphics®hasbeenproducingoutputfromPowerPoint®longerthananyoneintheindustry;daAngbacktowhenwehelpedMicrosoC®designthePowerPoint®soCware.USandCanada:1-800-790-4001Email:[email protected]

[Thissidebarareadoesnotprint.]

Not Just a Black Box: Interpretable Deep Learning for Genomics AvanAShrikumar1,PeytonGreenside2,AnshulKundaje1,3

1StanfordComputerScience,2StanfordDept.ofBiomedicalInformaAcs,3StanfordGeneAcs

•  Novelalgorithm(DeepLIFT)forexplainingpredicAonsofagivendeeplearningmodelforparAcularinputexamples

•  Novelalgorithm(MoDISco)forextracAngrecurringpamerns(moAfdiscovery)usingadeeplearningmodel

OurcontribuCons

Method:DeepLIFT(DeepLearningImportantFeatures)

1.  Alipanahi,B.,Delong,A.,Weirauch,M.,&Frey,B.(2015).PredicAngthesequencespecificiAesofDNA-andRNA-bindingproteinsbydeeplearning.NatBiotechnol2.  ZhouJ,TroyanskayaO.PredicAngeffectsofnoncodingvariantswithdeeplearning–basedsequencemodel.NatureMethods.20153.  Kelley,D.,Snoek,J.,&Rinn,J.(2015).Basset:LearningtheregulatorycodeoftheaccessiblegenomewithdeepconvoluAonalneuralnetworks.doi:10.1101/0283994.  HeinzS,e.(2016).SimplecombinaAonsoflineage-determiningtranscripAonfactorsprimecis-regulatoryelementsrequiredformacrophageandBcellidenAAes.

5.LimLS,e.(2016).ThepluripotencyregulatorZic3isadirectacAvatoroftheNanogpromoterinESCs.6.GagliardiA,e.(2016).AdirectphysicalinteracAonbetweenNanogandSox2regulatesembryonicstemcellself-renewal.-PubMed-NCBI.Ncbi.nlm.nih.gov.Retrieved30January2016,fromhmps://www.ncbi.nlm.nih.gov/pubmed/238924567.Kheradpour,P.,&Kellis,M.(2014).SystemaAcdiscoveryandcharacterizaAonofregulatorymoAfsinENCODETFbindingexperiments.Nucleicacidsresearch

ResultsoflogisCcregressionmodeltrainedtopredictNanogbindingusingthetop3moCfhits,permoCf,perregion

VisualizingindividualpaQern-detectors:DeepBind(Alipanahietal.)

SuperiormoCfdiscoveryforNanogPosiCveset:5,473reproducibleNanogpeaksinH1-ESCfromENCODENegaCveset:258,987H1ESCDNase-seqpeaks

Method:MoDISco(MoCfDiscoveryfromImportanceScores)

i1=0 i2=0

h1=max(0,i1+2i2+1)=1

h2=max(0,i1+2i2-1)=0

y=h1+h2=1

i1=-1 i2=-1

h1=max(0,i1+2i2+1)=0

h2=max(0,i1+2i2-1)=0

y=h1+h2=0

Computebehaviourunder“reference” Usedifferencefromreferencetofindimportancescores

Gradientsassignimportanceof0tobothinputsinlaQercase,asgradientofh1andh2are0.Usingdifference-from-reference,weseeh1is-1belowitsreferencevalue;DeepLIFTassignsanimportanceof-1/3toi1and-2/3toi2

Gata(Rev.Comp.)Gata SPI1Gata(Rev.Comp.)

B-cells

Gata1ChIP-seqpeak SPI1ChIP-seqpeak

NoSPI1peakNoGata1ChIP-seqpeak

Erythroid

Revealcontext-specificuseofregulatorysequence

Results(DeepLIFT)

Modelarchitectureoverview

C G A T A A C C G A T A T

LearnedpaQerndetectors

Input:DNAsequencerepresentedasonesandzeros

LaterlayersbuildonpaQernsofpreviouslayer

AccessibleinErythroid

AccessibleinB-cells

Output:Accessible(+1)vsnotaccessible(0)

“Fullyconnected”layersincorporateallinfotogether

ACGT

0100

0010

1000

0001

1000

1000

0100

0100

0010

1000

0001

1000

0001

Computervision

All5ENCODENanogmoCfs

CanonicalHOMERmatchesto4

MoDISComoCfs

All32de-novoHOMERmoCfs

Top4de-novoHOMERmoCfs

All4MoDIScomoCfs

LogisCcRe

gression

auR

OC

0.0

1.0

(a)Obtainper-baseimportancescoresusing

DeepLIFT

(b)Segmentto“seqlets”ofhigh

importance

(c)“Autocomplete”seqletsusingDeepLIFT

informaCon

(d)Computedistancesbetweenpairsofseqletsviacross-correlaCon

(e)Clusterseqletsusingpairwisedistances

(f)Aggregateclusters

•  RegulatorysequenceinvolvescomplexhierarchicalpamernsthataredifficultforexisAngcomputaAonalmethodstomodel

•  DeepLearningtechniquesshowgreatpromiseinthisarea[1-3]butareconsidereduninterpretable“BlackBoxes”,limiAngtheirusefulnessformakingbiologicaldiscoveries

MoCvaCon

•  Goal:learnkeyregulatorysequencesgoverninghematopoesis•  Approach:

(1)ExperimentallyidenAfybiochemicallyacAveregionsindifferentcell-linesduringthehematopoiesislineage(2)TraindeeplearningmodeltopredictacAvityfromseq.(3)Interpretthemodeltolearnkeyregulatorysequences

Exampleproblem

PeytonGreenside

4/5ENCODENanogmoCfs

CorrespondingMoDIScomoCf

Zic3

Sox2 Oct4-Sox2-Nanog

Nanog

Results(MoDISco)

4MoCfclustersidenCfiedbyMoDISco:

++and--orientaAon

+-and-+orientaAon

Zic3andNanogseparaAon:

Oct-Sox-NanogandNanogseparaAon:

ShuffledZic3andNanogseparaAon:

Co-bindingbetweenZic3andNanog? FusionmoCffromsubclustering:

Individualexamples:

Protein-proteininteracCon:

original scores:8 scores:3 masked,8->3 scores:6 masked,8->6

|grad|

(sim

onyan)

Guide

dBa

ckprop

grad

ient*

inpu

tintegrated

grad

s-10

De

epLIFT-

RevealCa

ncel

Proof-of-concept:morphingan“8”toa3ora6DeeplearningmodelistrainedtorecognizehandwriQendigitsfromtheMNISTdatabase.Pixelsarerankedbydifferenceofimportancefororiginalclass(eg:8)andtargetclass(eg:3or6)bydifferentmethods.Upto20%ofpixelsmoreimportanttooriginalclassthantargetclasserased.

i1 i2 y

i1 i2<i1 i1–(i1-i2)=i2

i1 i2>i1 i1–0=i1

i1 i2

y=i1–h2

h1=i1-i21

-1

1 -1

y=min(i1,i2)àgradient0foreitheri1ori2

h2=max(0,h1)

y=i1–max(0,i1–i2)=min(i1,i2)

-6

y=i1-max(0,i1–i2)=10–max(0,4)=6

Standardbreakdown:4=(10fromi1)+(-6fromi2)

max(0,i1-i2)

i1-i2i1=10

i2=6

+10

Otherpossiblebreakdown:4=(4fromi1)+(0fromi2)

max(0,i1-i2)

i1-i2

i1=10

i2=6

4

0

Standardbreakdown:y=(10fromi1)–[(10fromi1)–(6fromi2)]=6fromi2Averageoverbothorders:y=(10fromi1)–[(7fromi1)+(-3fromi2)]=(3fromi1)+(3fromi2)

Average:4=(7fromi1)+(-3fromi2)

i1-i2

Consideri1=10,i2=6

ByconsideringdifferentordersforposiCveandnegaCveterms,canalsoimproveassignmentofimportancescores:

“AND”/minoperaCon: