PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019 · PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019 ABSTRACT BOOK...

Preview:

Citation preview

PACIFICSYMPOSIUMONBIOCOMPUTING2019

ABSTRACTBOOK

PosterPresenters:Posterspaceisassignedbyabstractpagenumber.Pleasefindthepagethatyourabstractisonandputyourposterontheposterboardwiththecorrespondingnumber(e.g.,ifyourabstractison

page50,putyourposteronboard#50).

Proceedingspaperswithoralpresentations#2-29arenotassignedposterspace.

Abstractsareorganizedfirstbysession,thenthelastnameofthefirstauthor.Presentingauthors’namesareinboldtext.

i

TABLEOFCONTENTS

PROCEEDINGSPAPERSWITHORALPRESENTATIONPATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK..........................................................................................................................................................................1THEEFFECTIVENESSOFMULTITASKLEARNINGFORPHENOTYPINGWITHELECTRONICHEALTHRECORDSDATA....................................................................................................................................................2DaisyYiDing,ChloeSimpson,StephenPfohl,DaveC.Kale,KennethJung,NigamH.Shah.................2

ODAL:AONE-SHOTDISTRIBUTEDALGORITHMTOPERFORMLOGISTICREGRESSIONSONELECTRONICHEALTHRECORDSDATAFROMMULTIPLECLINICALSITES................................................3RuiDuan,MaryReginaBoland,JasonH.Moore,YongChen..............................................................................3

PVCDETECTIONUSINGACONVOLUTIONALAUTOENCODERANDRANDOMFORESTCLASSIFIER.4MaxGordon,CranosWilliams..........................................................................................................................................4

PLATYPUS:AMULTIPLE–VIEWLEARNINGPREDICTIVEFRAMEWORKFORCANCERDRUGSENSITIVITYPREDICTION..................................................................................................................................................5KileyGraim,VerenaFriedl,KathleenE.Houlahan,JoshuaM.Stuart............................................................5

DEEPDOM:PREDICTINGPROTEINDOMAINBOUNDARYFROMSEQUENCEALONEUSINGSTACKEDBIDIRECTIONALLSTM....................................................................................................................................6YuexuJiang,DuolinWang,DongXu.............................................................................................................................6

IMPLEMENTINGANDEVALUATINGAGAUSSIANMIXTUREFRAMEWORKFORIDENTIFYINGGENEFUNCTIONFROMTNSEQDATA...........................................................................................................................7KevinLi,RachelChen,WilliamLindsey,AaronBest,MatthewDeJongh,ChristopherHenry,NathanTintle.............................................................................................................................................................................................7

RES2S2AM:DEEPRESIDUALNETWORK-BASEDMODELFORIDENTIFYINGFUNCTIONALNONCODINGSNPSINTRAIT-ASSOCIATEDREGIONS.............................................................................................8ZhengLiu,YaoYao,QiWei,BenjaminWeeder,StephenA.Ramsey...............................................................8

BI-DIRECTIONALRECURRENTNEURALNETWORKMODELSFORGEOGRAPHICLOCATIONEXTRACTIONINBIOMEDICALLITERATURE.............................................................................................................9ArjunMagge,DavyWeissenbacher,AbeedSarker,MatthewScotch,GracielaGonzalez-Hernandez.......9

COMPUTATIONALKIRCOPYNUMBERDISCOVERYREVEALSINTERACTIONBETWEENINHIBITORYRECEPTORBURDENANDSURVIVAL...............................................................................................10RachelM.Pyke,RaphaelGenolet,AlexandreHarari,GeorgeCoukos,DavidGfeller,HannahCarter......................................................................................................................................................................................................10

SEMANTICWORKFLOWSFORBENCHMARKCHALLENGES:ENHANCINGCOMPARABILITY,REUSABILITYANDREPRODUCIBILITY......................................................................................................................11ArunimaSrivastava,RavaliAdusumilli,HunterBoyce,DanielGarijo,VarunRatnakar,RajivMayani,ThomasYu,RaghuMachiraju,YolandaGil,ParagMallick.............................................................11

REMOVINGCONFOUNDINGFACTORSASSOCIATEDWEIGHTSINDEEPNEURALNETWORKSIMPROVESTHEPREDICTIONACCURACYFORHEALTHCAREAPPLICATIONS.......................................12HaohanWang,ZhenglinWu,EricP.Xing...............................................................................................................12

ii

PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA..................................................................................................................................................13ANOPTIMALPOLICYFORPATIENTLABORATORYTESTSININTENSIVECAREUNITS.....................14Li-FangCheng,NiranjaniPrasad,BarbaraE.Engelhardt.............................................................................14

CROWDVARIANT:ACROWDSOURCINGAPPROACHTOCLASSIFYCOPYNUMBERVARIANTS......15PeytonGreenside,JustinZook,MarcSalit,RyanPoplin,MadeleineCule,MarkDePristo.................15

AREPOSITORYOFMICROBIALMARKERGENESRELATEDTOHUMANHEALTHANDDISEASESFORHOSTPHENOTYPEPREDICTIONUSINGMICROBIOMEDATA...............................................................16WontackHan,YuzhenYe................................................................................................................................................16

AICM:AGENUINEFRAMEWORKFORCORRECTINGINCONSISTENCYBETWEENLARGEPHARMACOGENOMICSDATASETS..............................................................................................................................17ZhiyueTomHu,YutingYe,PatrickA.Newbury,HaiyanHuang,BinChen...............................................17

INTEGRATINGRNAEXPRESSIONANDVISUALFEATURESFORIMMUNEINFILTRATEPREDICTION...........................................................................................................................................................................18DerekReiman,LingdaoSha,IrvinHo,TimothyTan,DeniseLau,AlyA.Khan........................................18

OUTGROUPMACHINELEARNINGAPPROACHIDENTIFIESSINGLENUCLEOTIDEVARIANTSINNONCODINGDNAASSOCIATEDWITHAUTISMSPECTRUMDISORDER....................................................19MayaVarma,KelleyMariePaskov,Jae-YoonJung,BriannaSierraChrisman,NateTylerStockham,PeterYigitcanWashington,DennisPaulWall.................................................................................19

PRECISIONDRUGREPURPOSINGVIACONVERGENTEQTL-BASEDMOLECULESANDPATHWAYTARGETINGINDEPENDENTDISEASE-ASSOCIATEDPOLYMORPHISMS....................................................20FrancescaVitali,JoanneBerghout,JungweiFan,JianrongLi,QikeLi,HaiquanLi,YvesA.Lussier......................................................................................................................................................................................................20

DETECTINGPOTENTIALPLEIOTROPYACROSSCARDIOVASCULARANDNEUROLOGICALDISEASESUSINGUNIVARIATE,BIVARIATE,ANDMULTIVARIATEMETHODSON43,870INDIVIDUALSFROMTHEEMERGENETWORK.......................................................................................................21XinyuanZhang,YogasudhaVeturi,ShefaliS.Verma,WilliamBone,AnuragVerma,AnastasiaM.Lucas,ScottHebbring,JoshuaC.Denny,IanStanaway,GailP.Jarvik,DavidCrosslin,EricB.Larson,LauraRasmussen-Torvik,SarahA.Pendergrass,JordanW.Smoller,HakonHakonarson,PatrickSleiman,ChunhuaWeng,DavidFasel,Wei-QiWei,IftikharKullo,DanielSchaid,WendyK.Chung,MarylynD.Ritchie................................................................................................................................................................21

SINGLECELLANALYSIS–WHATISTHEFUTURE?....................................................................................22LISA:ACCURATERECONSTRUCTIONOFCELLTRAJECTORYANDPSEUDO-TIMEFORMASSIVESINGLECELLRNA-SEQDATA.........................................................................................................................................23YangChen,YupingZhang,ZhengqingOuyang....................................................................................................23

PARAMETERTUNINGISAKEYPARTOFDIMENSIONALITYREDUCTIONVIADEEPVARIATIONALAUTOENCODERSFORSINGLECELLRNATRANSCRIPTOMICS.......................................................................24QiwenHu,CaseyS.Greene..............................................................................................................................................24

TOPOLOGICALMETHODSFORVISUALIZATIONANDANALYSISOFHIGHDIMENSIONALSINGLE-CELLRNASEQUENCINGDATA......................................................................................................................................25TongxinWang,TravisJohnson,JieZhang,KunHuang.....................................................................................25

iii

WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA.......................................................................................................................................26LEVERAGINGSUMMARYSTATISTICSTOMAKEINFERENCESABOUTCOMPLEXPHENOTYPESINLARGEBIOBANKS................................................................................................................................................................27AngelaGasdaska,DerekFriend,RachelChen,JasonWestra,MatthewZawistowski,WilliamLindsey,NathanTintle.....................................................................................................................................................27

EVALUATIONOFPATIENTRE-IDENTIFICATIONUSINGLABORATORYTESTORDERSANDMITIGATIONVIALATENTSPACEVARIABLES........................................................................................................28KippW.Johnson,JessicaK.DeFreitas,BenjaminS.Glicksberg,JasonR.Bobe,JoelT.Dudley.......28

PROTECTINGGENOMICDATAPRIVACYWITHPROBABILISTICMODELING...........................................29SeanSimmons,BonnieBerger,CenkSahinalp.....................................................................................................29

PROCEEDINGSPAPERSWITHPOSTERPRESENTATIONSPATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK.......................................................................................................................................................................30SNPS2CHIP:LATENTFACTORSOFCHIP-SEQTOINFERFUNCTIONSOFNON-CODINGSNPS...........31ShankaraAnand,LaurynasKalesinskas,CraigSmail,YosukeTanigawa................................................31

DNASTEGANALYSISUSINGDEEPRECURRENTNEURALNETWORKS.......................................................32HoBae,ByunghanLee,SunyoungKwon,SungrohYoon...................................................................................32

LEARNINGCONTEXTUALHIERARCHICALSTRUCTUREOFMEDICALCONCEPTSWITHPOINCAIRÉEMBEDDINGSTOCLARIFYPHENOTYPES................................................................................................................33BrettK.Beaulieu-Jones,IsaacS.Kohane,AndrewL.Beam...........................................................................33

EXPLORINGMICRORNAREGULATIONOFCANCERWITHCONTEXT-AWAREDEEPCANCERCLASSIFIER.............................................................................................................................................................................34BlakePyman,AlirezaSedghi,ShekoofehAzizi,KathrinTyryshkin,NeilRenwick,ParvinMousavi........34

ESTIMATINGCLASSIFICATIONACCURACYINPOSITIVE-UNLABELEDLEARNING:CHARACTERIZATIONANDCORRECTIONSTRATEGIES.....................................................................................35RashikaRamola,ShantanuJain,PredragRadivojac.........................................................................................35

EXTRACTINGALLELICREADCOUNTSFROM250,000HUMANSEQUENCINGRUNSINSEQUENCEREADARCHIVE......................................................................................................................................................................36BrianTsui,MichelleDow,DylanSkola,HannahCarter....................................................................................36

AUTOMATICHUMAN-LIKEMININGANDCONSTRUCTINGRELIABLEGENETICASSOCIATIONDATABASEWITHDEEPREINFORCEMENTLEARNING......................................................................................37HaohanWang,XiangLiu,YifengTao,WentingYe,QiaoJin,WilliamW.Cohen,EricP.Xing..........37

iv

PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA..................................................................................................................................................38

INFLUENCEOFTISSUECONTEXTONGENEPRIORITIZATIONFORPREDICTEDTRANSCRIPTOME-WIDEASSOCIATIONSTUDIES........................................................................................................................................39BinglanLi,YogasudhaVeturi,YukiBradford,ShefaliS.Verma,AnuragVerma,AnastasiaM.Lucas,DavidW.Haas,MarylynD.Ritchie............................................................................................................................39

SINGLECELLANALYSIS–WHATISTHEFUTURE?....................................................................................40SHALLOWSPARSELY-CONNECTEDAUTOENCODERSFORGENESETPROJECTION............................41MaxwellP.Gold,AlexanderLeNail,ErnestFraenkel.........................................................................................41

WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA.......................................................................................................................................42IMPLEMENTINGAUNIVERSALINFORMEDCONSENTPROCESSFORTHEALLOFUSRESEARCHPROGRAM................................................................................................................................................................................43MeganDoerr,ShiraGrayson,SarahMoore,ChristineSuver,JohnWilbanks,JenniferWagner......43

POSTERPRESENTATIONSGENERAL.................................................................................................................................................................44ACONVOLUTIONALNEURALNETPREDICTSBINDINGPROPERTIESOFANANTIBODYLIBRARY45RishiBedi,RachelHovde,JacobGlanville................................................................................................................45

CNVAR:ASOFTWARETOOLFORGENOTYPINGCYP2D6USINGSHORTREADNEXTGENERATIONSEQUENCINGTECHNOLOGY...........................................................................................................................................46JohnLoganBlackIIIMD,HuguesSicottePhD,SandraE.Peterson,KimberleyJ.Harris,LieweiWangMDPhD,StevenSchererPhD,EricBoerwinklePhD,RichardA.GibbsPhD,SuzetteJ.BielinskiPhD,RichardWeinshilboumMD...................................................................................................................................46

NETWORKANALYSISOFDISTINCTCOHORTSALLOWSFORTHECOMPARISONOFKEYBIOLOGICALFUNCTIONSRELATEDTOTBPATHOGENESIS...........................................................................47CarlyBobak,MeghanE.Muse,AlexanderJ.Titus,BrockC.Christensen,A.JamesO'Malley,JaneE.Hill..............................................................................................................................................................................................47

VARIATIONINOPIOIDPRESCRIBINGPATTERNSINSURGICALPOPULATIONS....................................48SolineM.Boussard,MarylynD.Ritchie,MichelleWhirl-Carrillo,TinaHernandez-Boussard,TeriE.Klein......................................................................................................................................................................................48

REGIONALHETEROGENEITYINGENEEXPRESSION,REGULATIONANDCOHERENCEINHIPPOCAMPUSANDDORSOLATERALPREFRONTALCORTEXACROSSDEVELOPMENTANDSCHIZOPHRENIA..................................................................................................................................................................49LeonardoCollado-Torres,EmilyE.Burke,AmyPeterson,JooHeonShin,RIchardE.Straub,AnanditaRajpurohit,StephenA.Semick,WilliamS.Ulrich,BrainSeqConsortium,CristianValencia,RanTao,AmyDeep-Soboslay,ThomasM.Hyde,JoelE.Kleinman,DanielRWeinberger,,AndrewE.Jaffe1....................................................................................................................................................................49

FULL-LENGTHSEQUENCEASSEMBLYANDCHARACTERIZATIONOFHIGHLYPURIFIEDCIRCRNAISOFORMS................................................................................................................................................................................50SupriyoDe,AmareshC.Panda,MyriamGorospe.................................................................................................50

v

ACOMPREHENSIVEREVIEWANDASSESSMENTOFEXISTINGPATHWAYANALYSISAPPROACHES.........................................................................................................................................................................51Tuan-MinhNguyen,AdibShafi,TinNguyen,SorinDraghici.........................................................................51

ANEWPHYLOGENETICSAMPLINGMETHODUSINGGENERALIZED-ENSEMBLEALGORITHM.....52TetsuFurukawa,HiroyukiToh....................................................................................................................................52

CONVERGENTMECHANISMSPERTURBEDBYSCATTEREDSNPSSUSCEPTIBLETOALZHEIMER'SDISEASE....................................................................................................................................................................................53JialiHan,EdwinBaldwin,JinZhou,FeiYin,HaiquanLi,...................................................................................53

IDENTIFICATIONANDEVALUATIONOFCO-EXPRESSIONGENENETWORKSFORPACLITAXEL-INDUCEDPERIPHERALNEUROPATHYINBREASTCANCERSURVIVORS.................................................54KordM.Kober,JonD.Levine,JudyMastick,BruceCooper,StevenPaul,ChristineMiaskowsk1.....54

VARIFI-WEB-BASEDAUTOMATICVARIANTIDENTIFICATION,FILTERINGANDANNOTATIONOFAMPLICONSEQUENCINGDATA....................................................................................................................................55MilicaKrunic,PeterVenhuizen,LeonhardMüllauer,BettinaKaserer,ArndtvonHaeseler............55

STATISTICALINFERENCERELIEF(STIR)FEATURESELECTION..................................................................56TrangT.Le,RyanJ.Urbanowicz,JasonH.Moore,BrettA.McKinney.........................................................56

DEEPLEARNING-BASEDLONGITUDINALHETEROGENEOUSDATAINTEGRATIONFRAMEWORKFORAD-RELEVANTFEATUREEXTRACTION..........................................................................................................57GaramLee,KwangsikNho,ByungkonKang,Kyung-AhSohn,DokyoonKim..........................................57

MICROBIOMEANALYSISOFUNEXPLAINEDCASESOFPNEUMONIAINSOUTHKOREA....................58SooyeonLim,JaeKyungLee,JiYunNoh,WooJooKim......................................................................................58

POTRA:PATHWAYANALYSISOFCANCERGENOMICSDATAINTHECLOUD.........................................59MargaretLinan,JunwenWang,ValentinDinu.....................................................................................................59

EVALUATINGCELLLINESASMODELSFORMETASTATICCANCERTHROUGHINTEGRATIVEANALYSISOFOPENGENOMICDATA..........................................................................................................................60KeLiu,PatrickA.Newbury,BenjaminS.Glicksberg,WilliamZeng,EranR.Andrechek,BinChen60

PATHWAYANALYSISOFEHRANDNON-EHR-BASEDGWASCONNECTSLIPIDMETABOLISMTOTHEIMMUNERESPONSE.................................................................................................................................................61JasonE.Miller,ThomasJ.Hoffmann,ElizabethTheusch,CarlosIribarren,MarisaW.Medina,NeilRisch,RonaldM.Krauss,MarylynD.Ritchie............................................................................................................61

META-ANALYSISOFHETEROGENEITYANDBATCHEFFECTSINTHEA549CELLLINE...................62AbigailMoore,JohnCastorino.....................................................................................................................................62

HYPERPARAMETERTUNINGFORCHIP-SEQPEAKCALLINGSOFTWARETOOLSUSINGPARALLELIZEDBAYESIANOPTIMIZATION.............................................................................................................63DongpinOh,JinheeLee,SeonghyeonKim,DohyeonLee,DongwonChoo,GiltaeSong.......................63

vi

CROSS-STUDYMETA-ANALYSISIDENTIFIESALTEREDBACTERIALSTRAINSSEPARATINGRESPONDERANDNON-RESPONDERPOPULATIONSACROSSMULTIPLECHECKPOINT-INHIBITORTHERAPYDATASETS..........................................................................................................................................................64JayamaryDivyaRavichandar,EricaRutherford,YongganWu,ThomasWeinmaier,Cheryl-EmilianeChow,ShokoIwai,HelenaKiefel,KareemGraham,KarimDabbagh,ToddDeSantis.......64

AHYPOTHESISOFTHESTABILIZINGROLEOFALUEXPANSIONVIAHOMOLOGYDIRECTEDREPAIROFSPONTANEOUSDNADOUBLESTRANDEDBREAKS....................................................................65TanmoyRoychowdhury,AlexejAbyzov....................................................................................................................65

STATISTICALLEARNINGWITHHIGH-DIMENSIONALMASSCYTOMETRYDATA..................................66PratyaydiptaRudra,ElenaHsieh,DebashisGhosh............................................................................................66

HARDWAREACCELERATIONOFAPPROXIMATESTRINGMATCHINGFORBOTHSHORTANDLONGREADMAPPING.......................................................................................................................................................67DamlaSenolCali,LavanyaSubramanian,ZülalBingöl,JeremieS.Kim,RachataAusavarungnirun,AnantV.Nori,GurpreetS.Kalsi,SreenivasSubramoney,SaugataGhose,CanAlkan,OnurMutlu

TRANSITIONOFREGULATORYFORCETOWARDTHEGENEEXPRESSIONSDURINGOSTEOBLASTCELLDIFFERENTIATION..................................................................................................................................................68YoichiTakenaka................................................................................................................................................................68

METHYLATIONPROFILESOFMELANOMATOPREDICTTILS........................................................................69YihsuanTsai,NanaNikolaishviliFeinberg,KathleenConway,SharonN.Edmiston,NancyE.Thomas,JoelS.Parker.......................................................................................................................................................69

HIGH-THROUGHPUTGENETOKNOWLEDGEMAPPINGTHROUGHMASSIVEINTEGRATIONOFPUBLICSEQUENCINGDATA............................................................................................................................................70BrianTsui,HannahCarter.............................................................................................................................................70

MANTA-RAE,PREDICTINGTHEIMPACTOFGENOMEVARIANTSONTHETRANSCRIPTIONFACTORBINDINGPOTENTIALOFREGULATORYELEMENTS........................................................................71RobinvanderLee,PhillipA.Richmond,OriolFornes,WyethW.Wasserman.......................................71

USINGQUANTITATIVEPHOSPHOPROTEOMICSTOUNDERSTANDFUNCTIONALSELECTIVITYOFRECEPTORTYROSINEKINASES....................................................................................................................................72J.Watson,C.Francavilla,J.M.Schwartz....................................................................................................................72

ANERISAPPLIED:SPARK-ENABLEDANALYTICSFORFULL-SCALEANDREPRODUCIBLEANNOTATION-BASEDGENOMICSTUDIES...............................................................................................................73NicholasWheeler,JeremyFondran,PennyBenchek,JonathanHaines,WilliamS.Bush..................73

PUTTINGRELICANTHUSINITSPLACE:IMPACTOFMIXTUREMODELCHOICEONPHYLOGENETICRECONSTRUCTION...........................................................................................................................74MadelyneXiao,MercerR.Brugler,EstefaniaRodriguez..................................................................................74

RATIONALDESIGNOFNOVELSKP2INHIBITORSUSINGDEEPNEURALNETWORKS........................75ShuxingZhang,BeibeiHuang,LonW.Fong..........................................................................................................75

vii

PATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK.......................................................................................................................................................................76ODAL:AONE-SHOTDISTRIBUTEDALGORITHMTOPERFORMLOGISTICREGRESSIONSONELECTRONICHEALTHRECORDSDATAFROMMULTIPLECLINICALSITES.............................................77RuiDuan,MaryReginaBoland,JasonH.Moore,YongChen...........................................................................77

PLATYPUS:AMULTIPLE-VIEWLEARNINGPREDICTIVEFRAMEWORKFORCANCERDRUGSENSITIVITYPREDICTION...............................................................................................................................................78KileyGraim,VerenaFriedl,KathleenE.Houlahan,JoshuaM.Stuart.........................................................78

ASOFTWAREPIPELINEFORDETERMININGFINE-SCALETEMPORALGENOMEVARIATIONPATTERNSINEVOLVINGPOPULATIONSUSINGANON-PARAMETRICSTATISTICALTEST............79MinjungKwak,SeokwooKang,DongwonChoo,DohyeonLee,JinheeLee,SeonghyeonKim,GiltaeSong...........................................................................................................................................................................................79

ADEEPLEARNINGAPPROACHTOIDENTIFYINGTHECELLULARCOMPOSITIONOFSOLIDTISSUEWITHDNAMETHYLATIONDATA................................................................................................................................80MeghanE.Muse,CurtisL.Petersen,CarmenJ.Marsit,DianeGilbert-Diamond,BrockC.Christensen..80

DIRECTLYMEASURINGTHERATEANDDYNAMICSHUMANMUTATIONBYSEQUENCINGLARGE,MULTI-GENERATIONALPEDIGREES..........................................................................................................................81ThomasA.Sasani,BrentS.Pedersen,MarkLeppert,RayWhite,LisaBaird,AaronR.Quinlan,LynnB.Jorde..........................................................................................................................................................................81

AVAILABLEPROTEIN3DSTRUCTURESDONOTREFLECTHUMANGENETICANDFUNCTIONALDIVERSITY...............................................................................................................................................................................82GregorySliwoski,NeelPatel,R.MichaelSivley,CharlesR.Sanders,JensMeiler,WilliamS.Bush,JohnA.Capra..........................................................................................................................................................................82

SEMANTICWORKFLOWSFORBENCHMARKCHALLENGES:ENHANCINGCOMPARABILITY,REUSABILITYANDREPRODUCIBILITY......................................................................................................................83ArunimaSrivastava,RavaliAdusumilli,HunterBoyce,DanielGarijo,VarunRatnakar,RajivMayani,ThomasYu,RaghuMachiraju,YolandaGil,ParagMallick.............................................................83

PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA..................................................................................................................................................84CLASSPRIORESTIMATIONANDQUANTIFICATIONOFTHELOSSANDGAINOFRESIDUEFUNCTIONUPONMUTATION.........................................................................................................................................85ShantanuJain,JoseLugo-Martinez,MarthaWhite,MichaelW.Trosset,PredragRadivojac..........85

PREDICTIONOFTIMETOINSULINUSINGCLINICALANDGENETICBIOMARKERSINTYPE2DIABETESPATIENTS..........................................................................................................................................................86RikkeLinnemannNielsen,LouiseDonnelly,AgnesMartineNielsen,KonstantinosTsirigos,KaixinZhou,BjarneErsboell,LineClemmensen,EwanPearson,RamneekGupta................................................86

PATHOGENICITYANDFUNCTIONALIMPACTOFINSERTION/DELETIONANDSTOPGAINVARIATIONINTHEHUMANGENOME.......................................................................................................................87KymberleighA.Pagel,DannyAntaki,MatthewMort,DavidN.Cooper,JonathanSebat,LiliaM.Iakoucheva,SeanD.Mooney,PredragRadivojac...............................................................................................87

viii

DETECTINGPOTENTIALPLEIOTROPYACROSSCARDIOVASCULARANDNEUROLOGICALDISEASESUSINGUNIVARIATE,BIVARIATE,ANDMULTIVARIATEMETHODSON43,870INDIVIDUALSFROMTHEEMERGENETWORK......................................................................................................88XinyuanZhang,YogasudhaVeturi,ShefaliS.Verma,WilliamBone,AnuragVerma,AnastasiaM.Lucas,ScottHebbring,JoshuaC.Denny,IanStanaway,GailP.Jarvik,DavidCrosslin,EricB.Larson,LauraRasmussen-Torvik,SarahA.Pendergrass,JordanW.Smoller,HakonHakonarson,PatrickSleiman,ChunhuaWeng,DavidFasel,Wei-QiWei,IftikharKullo,DanielSchaid,WendyK.Chung,MarylynD.Ritchie................................................................................................................................................................88

PHARMGKB:THEAPIANDINFOBUTTONS......................................................................................................................89MichelleWhirl-Carrillo,RyanM.Whaley,MarkWoon,RussB.Altman,TeriE.Klein.......................89

SINGLECELLANALYSIS–WHATISINTHEFUTURE?...............................................................................90INTRATUMORHETEROGENEITY(ITH)METRICOFCIRCULATINGTUMORCELL(CTC)-DERIVEDXENOGRAFTMODELSINSMALLCELLLUNGCANCER.......................................................................................91YuanxinXi,C.AllisonStewart,CarlM.Gay,HaiTran,BonnieGlisson,JohnV.Heymach,PaulRobson,LaurenA.Byers,JingWang............................................................................................................................91

WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA.......................................................................................................................................92QUANTIFYINGTHEIDENTIFIABILITYOFINDIVIDUALSUSINGASPARSESETOFSNPS...................93PrashantS.Emani,GamzeGursoy,MarkB.Gerstein........................................................................................93

TRANSCRIPTOMICSUMMARYSPLICINGDATAMAYLEAKPERSONALPRIVATEINFORMATIONBYCOMPUTATIONALLINKAGETOTHEGENOMICVARIANTS.............................................................................94ZhiqiangHu,MarkB.Gerstein,StevenE.Brenner..............................................................................................94

WORKSHOPSMERGINGHETEROGENEOUSDATATOENABLEKNOWLEDGEDISCOVERY.....................................95TOSEARCHAHETNET...HOWARETWONODESCONNECTED?....................................................................96DanielHimmelstein,MichaelZietz,KyleKloster,MichaelNagle,BlairSullivan,CaseyS.Greene96

TEXTMININGANDMACHINELEARNINGFORPRECISIONMEDICINE.................................................97LITVAR:MININGGENOMICVARIANTSFROMBIOMEDICALLITERATUREFORDATABASECURATIONANDPRECISIONMEDICINE.....................................................................................................................98AlexisAllot,YifanPeng,Chih-HsuanWei,KyubumLee,LonPhan,ZhiyongLu......................................98

AUTHORINDEX.............................................................................................................................................99

1

PATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK

PROCEEDINGSPAPERSWITHORALPRESENTATIONS

2

THEEFFECTIVENESSOFMULTITASKLEARNINGFORPHENOTYPINGWITHELECTRONICHEALTHRECORDSDATA

DaisyYiDing1,ChloeSimpson1,StephenPfohl1,DaveC.Kale2,KennethJung1,NigamH.Shah1

1StanfordUniversity,2UniversityofSouthernCalifornia

Ding,DaisyYiElectronicphenotypingisthetaskofascertainingwhetheranindividualhasamedicalconditionofinterestbyanalyzingtheirmedicalrecordandisfoundationalinclinicalinformatics.Increasingly,electronicphenotypingisperformedviasupervisedlearning.Weinvestigatetheeffectivenessofmultitasklearningforphenotypingusingelectronichealthrecords(EHR)data.Multitasklearningaimstoimprovemodelperformanceonatargettaskbyjointlylearningadditionalauxiliarytasksandhasbeenusedindisparateareasofmachinelearning.However,itsutilitywhenappliedtoEHRdatahasnotbeenestablished,andpriorworksuggeststhatitsbenefitsareinconsistent.WepresentexperimentsthatelucidatewhenmultitasklearningwithneuralnetsimprovesperformanceforphenotypingusingEHRdatarelativetoneuralnetstrainedforasinglephenotypeandtowell-tunedbaselines.Wefindthatmultitaskneuralnetsconsistentlyoutperformsingle-taskneuralnetsforrarephenotypesbutunderperformforrelativelymorecommonphenotypes.Theeffectsizeincreasesasmoreauxiliarytasksareadded.Moreover,multitasklearningreducesthesensitivityofneuralnetstohyperparametersettingsforrarephenotypes.Last,wequantifyphenotypecomplexityandfindthatneuralnetstrainedwithorwithoutmultitasklearningdonotimproveonsimplebaselinesunlessthephenotypesaresufficientlycomplex.

3

ODAL:AONE-SHOTDISTRIBUTEDALGORITHMTOPERFORMLOGISTICREGRESSIONSONELECTRONICHEALTHRECORDSDATAFROMMULTIPLE

CLINICALSITES

RuiDuan,MaryReginaBoland,JasonH.Moore,YongChen

DepartmentofBiostatistics,Epidemiology&Informatics,UniversityofPennsylvaniaChen,YongElectronicHealthRecords(EHR)containextensiveinformationonvarioushealthoutcomesandriskfactors,andthereforehavebeenbroadlyusedinhealthcareresearch.IntegratingEHRdatafrommultipleclinicalsitescanaccelerateknowledgediscoveryandriskpredictionbyprovidingalargersamplesizeinamoregeneralpopulationwhichpotentiallyreducesclinicalbiasandimprovesestimationandpredictionaccuracy.Toovercomethebarrierofpatient-leveldatasharing,distributedalgorithmsaredevelopedtoconductstatisticalanalysesacrossmultiplesitesthroughsharingonlyaggregatedinformation.Thecurrentdistributedalgorithmoftenrequiresiterativeinformationevaluationandtransferringacrosssites,whichcanpotentiallyleadtoahighcommunicationcostinpracticalsettings.Inthisstudy,weproposeaprivacy-preservingandcommunication-efficientdistributedalgorithmforlogisticregressionwithoutrequiringiterativecommunicationsacrosssites.Oursimulationstudyshowedouralgorithmreachedcomparativeaccuracycomparingtotheoracleestimatorwheredataarepooledtogether.WeappliedouralgorithmtoanEHRdatafromtheUniversityofPennsylvaniahealthsystemtoevaluatetherisksoffetallossduetovariousmedicationexposures.

4

PVCDETECTIONUSINGACONVOLUTIONALAUTOENCODERANDRANDOMFORESTCLASSIFIER

MaxGordon,CranosWilliams

NorthCarolinaStateUniversityGordon,MaxTheaccuratedetectionofprematureventricularcontractions(PVCs)inpatientsisanimportanttaskincardiaccareforsomepatients.Insomecases,theusefulnesstophysiciansindetectingPVCsstemsfromtheirlong-termcorrelationswithdangerousheartconditions.Inothercasestheirpotentialasaprecursortoseriouscardiaceventsmaymaketheirdetectionausefulearlywarningmechanism.Inmanyoftheseapplications,thelong-termnatureofthemonitoringrequiredandtheinfrequencyofPVCsmakemanualobservationforPVCsimpractical.ExistingmethodsofautomatedPVCdetectionsufferfromdrawbackssuchastheneedtousedifficulttoextractmorphologicalfeatures,domain-specificfeatures,orlargenumbersofestimatedparameters.Inparticular,systemsusinglargenumbersoftrainedparametershavethepotentialtorequirelargeamountsoftrainingdataandcomputationandmayhaveissuesgeneralizingduetotheirpotentialtooverfit.Toaddresssomeofthesedrawbacks,wedevelopedanovelPVCdetectionalgorithmbasedaroundaconvolutionalautoencodertoaddresstheseweaknessesandvalidatedourmethodusingtheMIT-BIHarrhythmiadatabase.

5

PLATYPUS:AMULTIPLE–VIEWLEARNINGPREDICTIVEFRAMEWORKFORCANCERDRUGSENSITIVITYPREDICTION

KileyGraim,VerenaFriedl,KathleenE.Houlahan,JoshuaM.Stuart

Dept.ofBiomolecularEngineeringUniversityofCaliforniaSantaCruz,FlatironInstituteandPrincetonUniversity,OntarioInstituteofCancerResearchandUniversityofTorontoGraim,KileyCancerisacomplexcollectionofdiseasesthataretosomedegreeuniquetoeachpatient.Precisiononcologyaimstoidentifythebestdrugtreatmentregimeusingmoleculardataontumorsamples.Whileomics-leveldataisbecomingmorewidelyavailablefortumorspecimens,thedatasetsuponwhichcomputationallearningmethodscanbetrainedvaryincoveragefromsampletosampleandfromdatatypetodatatype.Methodsthatcan‘connectthedots’toleveragemoreoftheinformationprovidedbythesestudiescouldoffermajoradvantagesformaximizingpredictivepotential.Weintroduceamulti-viewmachine-learningstrategycalledPLATYPUSthatbuilds‘views’frommultipledatasourcesthatareallusedasfeaturesforpredictingpatientoutcomes.Weshowthatalearningstrategythatfindsagreementacrosstheviewsonunlabeleddataincreasestheperformanceofthelearningmethodsoveranysingleview.Weillustratethepoweroftheapproachbyderivingsignaturesfordrugsensitivityinalargecancercelllinedatabase.CodeandadditionalinformationareavailablefromthePLATYPUSwebsitehttps://sysbiowiki.soe.ucsc.edu/platypus.

6

DEEPDOM:PREDICTINGPROTEINDOMAINBOUNDARYFROMSEQUENCEALONEUSINGSTACKEDBIDIRECTIONALLSTM

YuexuJiang,DuolinWang,DongXu

DepartmentofElectricalEngineeringandComputerScience,BondLifeSciencesCenter,UniversityofMissouri,Columbia,Missouri65211,USAEmail:xudong@missouri.edu

Jiang,YuexuProteindomainboundarypredictionisusuallyanearlysteptounderstandproteinfunctionandstructure.Mostofthecurrentcomputationaldomainboundarypredictionmethodssufferfromlowaccuracyandlimitationinhandlingmulti-domaintypes,orevencannotbeappliedoncertaintargetssuchasproteinswithdiscontinuousdomain.Wedevelopedanab-initioproteindomainpredictorusingastackedbidirectionalLSTMmodelindeeplearning.Ourmodelistrainedbyalargeamountofproteinsequenceswithoutusingfeatureengineeringsuchassequenceprofiles.Hence,thepredictionsusingourmethodismuchfasterthanothers,andthetrainedmodelcanbeappliedtoanytypeoftargetproteinswithoutconstraint.WeevaluatedDeepDombya10-foldcrossvalidationandalsobyapplyingitontargetsindifferentcategoriesfromCASP8andCASP9.ThecomparisonwithothermethodshasshownthatDeepDomoutperformsmostofthecurrentab-initiomethodsandevenachievesbetterresultsthanthetop-leveltemplate-basedmethodincertaincases.ThecodeofDeepDomandthetestdataweusedinCASP8,9canbeaccessedthroughGitHubathttps://github.com/yuexujiang/DeepDom.

7

IMPLEMENTINGANDEVALUATINGAGAUSSIANMIXTUREFRAMEWORKFORIDENTIFYINGGENEFUNCTIONFROMTNSEQDATA

KevinLi1,RachelChen2,WilliamLindsey3,AaronBest4,MatthewDeJongh4,ChristopherHenry5,NathanTintle3

1ColumbiaUniversity,2NorthCarolinaStateUniversity,3DordtCollege,4HopeCollege,

5ArgonneLaboratoryLi,KevinTherapidaccelerationofmicrobialgenomesequencingincreasesopportunitiestounderstandbacterialgenefunction.Unfortunately,onlyasmallproportionofgeneshavebeenstudied.Recently,TnSeqhasbeenproposedasacost-effective,highlyreliableapproachtopredictgenefunctionsasaresponsetochangesinacell’sfitnessbefore-aftergenomicchanges.However,majorquestionsremainabouthowtobestdeterminewhetheranobservedquantitativechangeinfitnessrepresentsameaningfulchange.Toaddressthelimitation,wedevelopaGaussianmixturemodelframeworkforclassifyinggenefunctionfromTnSeqexperiments.Inordertoimplementthemixturemodel,wepresenttheExpectation-MaximizationalgorithmandahierarchicalBayesianmodelsampledusingStan’sHamiltonianMonte-Carlosampler.WecomparetheseimplementationsagainstthefrequentistmethodusedincurrentTnSeqliterature.FromsimulationsandrealdataproducedbyE.coliTnSeqexperiments,weshowthattheBayesianimplementationoftheGaussianmixtureframeworkprovidesthemostconsistentclassificationresults.

8

RES2S2AM:DEEPRESIDUALNETWORK-BASEDMODELFORIDENTIFYINGFUNCTIONALNONCODINGSNPSINTRAIT-ASSOCIATEDREGIONS

ZhengLiu,YaoYao,QiWei,BenjaminWeeder,StephenA.Ramsey

OregonStateUniversityLiu,ZhengNoncodingsinglenucleotidepolymorphisms(SNPs)andtheirtargetgenesareimportantcomponentsoftheheritabilityofdiseasesandotherpolygenictraits.IdentifyingtheseSNPsandtargetgenescouldpotentiallyrevealnewmolecularmechanismsandadvanceprecisionmedicine.Forpolygenictraits,genome-wideassociationstudies(GWAS)arepreferredtoolsforidentifyingtrait-associatedregions.However,identifyingcausalnoncodingSNPswithinsuchregionsisadifficultproblemincomputationalbiology.TheDNAsequencecontextofanoncodingSNPiswell-establishedasanimportantsourceofinformationthatisbeneficialfordiscriminatingfunctionalfromnonfunctionalnoncodingSNPs.Wedescribetheuseofadeepresidualnetwork(ResNet)-basedmodel—entitledRes2s2aM—thatfusesflankingDNAsequenceinformationwithadditionalSNPannotationinformationtodiscriminatefunctionalfromnonfunctionalnoncodingSNPs.Onaground-truthsetofdisease-associatedSNPscompiledfromtheGenome-wideRepositoryofAssociationsbetweenSNPsandPhenotypes(GRASP)database,Res2s2aMimprovesthepredictionaccuracyoffunctionalSNPssignificantlyincomparisontomodelsbasedonlyonsequenceinformationaswellasaleadingtoolforpost-GWASnoncodingSNPprioritization(RegulomeDB).

9

BI-DIRECTIONALRECURRENTNEURALNETWORKMODELSFORGEOGRAPHICLOCATIONEXTRACTIONINBIOMEDICALLITERATURE

ArjunMagge1,DavyWeissenbacher2,AbeedSarker2,MatthewScotch1,GracielaGonzalez-Hernandez2

1ArizonaStateUniversity,2UniversityofPennsylvania

Magge,ArjunPhylogeographyresearchinvolvingvirusspreadandtreereconstructionreliesonaccurategeographiclocationsofinfectedhosts.InsufficientlevelofgeographicinformationinnucleotidesequencerepositoriessuchasGenBankmotivatestheuseofnaturallanguageprocessingmethodsforextractinggeographiclocationnames(toponyms)inthescientificarticleassociatedwiththesequence,anddisambiguatingthelocationstotheirco-ordinates.Inthispaper,wepresentanextensivestudyofmultiplerecurrentneuralnetworkarchitecturesforthetaskofextractinggeographiclocationsandtheireffectivecontributiontothedisambiguationtaskusingpopulationheuristics.ThemethodspresentedinthispaperachieveastrictdetectionF-1scoreof0.94,disambiguationaccuracyof91%andanoverallresolutionF-1scoreof0.88thataresignificantlyhigherthanpreviouslydevelopedmethods,improvingourcapabilitytofindthelocationofinfectedhostsandenrichmetadatainformation.

10

COMPUTATIONALKIRCOPYNUMBERDISCOVERYREVEALSINTERACTIONBETWEENINHIBITORYRECEPTORBURDENANDSURVIVAL

RachelM.Pyke1,RaphaelGenolet2,AlexandreHarari2,GeorgeCoukos2,DavidGfeller2,HannahCarter1

1UniversityofCalifornia-SanDiego,2LudwigInstituteforCancerResearch-Universityof

LausannePyke,RachelM.Naturalkiller(NK)cellshaveincreasinglybecomeatargetofinterestforimmunotherapies1.NKcellsexpresskillerimmunoglobulin-likereceptors(KIRs),whichplayavitalroleinimmuneresponsetotumorsbydetectingcellularabnormalities.Thegenomicregionencodingthe16KIRgenesdisplayshighpolymorphicvariabilityinhumanpopulations,makingitdifficulttoresolveindividualgenotypesbasedonnextgenerationsequencingdata.Asaresult,theimpactofpolymorphicKIRvariationoncancerphenotypeshasbeenunderstudied.Currently,labor-intensive,experimentaltechniquesareusedtodetermineanindividual’sKIRgenecopynumberprofile.Here,wedevelopanalgorithmtodeterminethegermlinecopynumberofKIRgenesfromwholeexomesequencingdataandapplyittoacohortofnearly5000cancerpatients.Weuseak-merbasedapproachtocapturesequencesuniquetospecificgenes,counttheiroccurrencesinthesetofreadsderivedfromanindividualandcomparetheindividual’sk-merdistributiontothatofthepopulation.Copynumberresultsdemonstratehighconcordancewithpopulationcopynumberexpectations.OurmethodrevealsthattheburdenofinhibitoryKIRgenesisassociatedwithsurvivalintwotumortypes,highlightingthepotentialimportanceofKIRvariationinunderstandingtumordevelopmentandresponsetoimmunotherapy.

11

SEMANTICWORKFLOWSFORBENCHMARKCHALLENGES:ENHANCINGCOMPARABILITY,REUSABILITYANDREPRODUCIBILITY

ArunimaSrivastava1,RavaliAdusumilli2,HunterBoyce2,DanielGarijo3,VarunRatnakar3,RajivMayani3,ThomasYu4,RaghuMachiraju1,YolandaGil3,ParagMallick2

1TheOhioStateUniversity,2StanfordUniversity,3UniversityofSouthernCalifornia,4Sage

BionetworksSrivastava,ArunimaBenchmarkchallenges,suchastheCriticalAssessmentofStructurePrediction(CASP)andDialogueforReverseEngineeringAssessmentsandMethods(DREAM)havebeeninstrumentalindrivingthedevelopmentofbioinformaticsmethods.Typically,challengesareposted,andthencompetitorsperformapredictionbaseduponblindedtestdata.Challengersthensubmittheiranswerstoacentralserverwheretheyarescored.RecenteffortstoautomatethesechallengeshavebeenenabledbysystemsinwhichchallengerssubmitDockercontainers,aunitofsoftwarethatpackagesupcodeandallofitsdependencies,toberunonthecloud.Despitetheirincrediblevalueforprovidinganunbiasedtest-bedforthebioinformaticscommunity,thereremainopportunitiestofurtherenhancethepotentialimpactofbenchmarkchallenges.Specifically,currentapproachesonlyevaluateend-to-endperformance;itisnearlyimpossibletodirectlycomparemethodologiesorparameters.Furthermore,thescientificcommunitycannoteasilyreusechallengers’approaches,duetolackofspecifics,ambiguityintoolsandparametersaswellasproblemsinsharingandmaintenance.Lastly,theintuitionbehindwhyparticularstepsareusedisnotcaptured,astheproposedworkflowsarenotexplicitlydefined,makingitcumbersometounderstandtheflowandutilizationofdata.HereweintroduceanapproachtoovercometheselimitationsbasedupontheWINGSsemanticworkflowsystem.Specifically,WINGSenablesresearcherstosubmitcompletesemanticworkflowsaschallengesubmissions.Bysubmittingentriesasworkflows,itthenbecomespossibletocomparenotjusttheresultsandperformanceofachallenger,butalsothemethodologyemployed.Thisisparticularlyimportantwhendozensofchallengeentriesmayusenearlyidenticaltools,butwithonlysubtlechangesinparameters(andradicaldifferencesinresults).WINGSusesacomponentdrivenworkflowdesignandoffersintelligentparameteranddataselectionbyreasoningaboutdatacharacteristics.Thisprovestobeespeciallycriticalinbioinformaticsworkflowswhereusingdefaultorincorrectparametervaluesispronetodrasticallyalteringresults.Differentchallengeentriesmaybereadilycomparedthroughtheuseofabstractworkflows,whichalsofacilitatereuse.WINGSishousedonacloudbasedsetup,whichstoresdata,dependenciesandworkflowsforeasysharingandutility.ItalsohastheabilitytoscaleworkflowexecutionsusingdistributedcomputingthroughthePegasusworkflowexecutionsystem.WedemonstratetheapplicationofthisarchitecturetotheDREAMproteogenomicchallenge.

12

REMOVINGCONFOUNDINGFACTORSASSOCIATEDWEIGHTSINDEEPNEURALNETWORKSIMPROVESTHEPREDICTIONACCURACYFORHEALTHCARE

APPLICATIONS

HaohanWang1,ZhenglinWu2,EricP.Xing3

1CarnegieMellonUniversity,2UniversityofIllinoisUrbana-Champaign,3CarnegieMellonUniversity

Wang,HaohanTheproliferationofhealthcaredatahasbroughttheopportunitiesofapplyingdata-drivenapproaches,suchasmachinelearningmethods,toassistdiagnosis.Recently,manydeeplearningmethodshavebeenshownwithimpressivesuccessesinpredictingdiseasestatuswithrawinputdata.However,the``black-box''natureofdeeplearningandthehigh-reliabilityrequirementofbiomedicalapplicationshavecreatednewchallengesregardingtheexistenceofconfoundingfactors.Inthispaper,withabriefargumentthatinappropriatehandlingofconfoundingfactorswillleadtomodels'sub-optimalperformanceinreal-worldapplications,wepresentanefficientmethodthatcanremovetheinfluencesofconfoundingfactorssuchasageorgendertoimprovetheacross-cohortpredictionaccuracyofneuralnetworks.Onedistinctadvantageofourmethodisthatitonlyrequiresminimalchangesofthebaselinemodel'sarchitecturesothatitcanbepluggedintomostoftheexistingneuralnetworks.WeconductexperimentsacrossCT-scan,MRA,andEEGbrainwavewithconvolutionalneuralnetworksandLSTMtoverifytheefficiencyofourmethod.

13

PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA

PROCEEDINGSPAPERSWITHORALPRESENTATIONS

14

ANOPTIMALPOLICYFORPATIENTLABORATORYTESTSININTENSIVECAREUNITS

Li-FangCheng,NiranjaniPrasad,BarbaraE.Engelhardt

PrincetonUniversityPrasad,NiranjaniLaboratorytestingisanintegraltoolinthemanagementofpatientcareinhospitals,particularlyinintensivecareunits(ICUs).Thereexistsaninherenttrade-offintheselectionandtimingoflabtestsbetweenconsiderationsoftheexpectedutilityinclinicaldecision-makingofagiventestataspecifictime,andtheassociatedcostorriskitposestothepatient.Inthiswork,weintroduceaframeworkthatlearnspoliciesfororderinglabtestswhichoptimizesforthistrade-off.Ourapproachusesbatchoff-policyreinforcementlearningwithacompositerewardfunctionbasedonclinicalimperatives,appliedtodatathatincludeexamplesofcliniciansorderinglabsforpatients.Tothisend,wedevelopandextendprinciplesofParetooptimalitytoimprovetheselectionofactionsbasedonmultiplerewardfunctioncomponentswhilerespectingtypicalproceduralconsiderationsandprioritizationofclinicalgoalsintheICU.Ourexperimentsshowthatwecanestimateapolicythatreducesthefrequencyoflabtestsandoptimizestimingtominimizeinformationredundancy.Wealsofindthattheestimatedpoliciestypicallysuggestorderinglabtestswellaheadofcriticalonsets---suchasmechanicalventilationordialysis---thatdependonthelabresults.Weevaluateourapproachbyquantifyinghowthesepoliciesmayinitiateearlieronsetoftreatment.

15

CROWDVARIANT:ACROWDSOURCINGAPPROACHTOCLASSIFYCOPYNUMBERVARIANTS

PeytonGreenside1,JustinZook2,MarcSalit3,RyanPoplin4,MadeleineCule5,MarkDePristo4

1StanfordUniversity,2NationalInstituteofStandardsandTechnologies(NIST),3NationalInstituteofStandardsandTechnologies(NIST)/JointInitiativeforMetrologyinBiology

(JIMB),4GoogleInc./VerilyLifeSciences,5Calico/VerilyLifeSciencesGreenside,PeytonCopynumbervariants(CNVs)areanimportanttypeofgeneticvariationthatplayacausalroleinmanydiseases.TheabilitytoidentifyhighqualityCNVsisofsubstantialclinicalrelevance.However,CNVsarenotoriouslydifficulttoidentifyaccuratelyfromarray-basedmethodsandnext-generationsequencing(NGS)data,particularlyforsmall(<10kbp)CNVs.Manualcurationbyexpertswidelyremainsthegoldstandardbutcannotscalewiththepaceofsequencing,particularlyinfast-growingclinicalapplications.Wepresentthefirstproof-of-principlestudydemonstratinghighthroughputmanualcurationofputativeCNVsbynon-experts.Wedevelopedacrowdsourcingframework,calledCrowdVariant,thatleveragesGoogle'shigh-throughputcrowdsourcingplatformtocreateahighconfidencesetofdeletionsforNA24385(NISTHG002/RM8391),anAshkenazimreferencesampledevelopedinpartnershipwiththeGenomeInABottle(GIAB)Consortium.Weshowthatnon-expertstendtoagreebothwitheachotherandwithexpertsonputativeCNVs.Weshowthatcrowdsourcednon-expertclassificationscanbeusedtoaccuratelyassigncopynumberstatustoputativeCNVcallsandidentify1,781highconfidencedeletionsinareferencesample.MultiplelinesofevidencesuggestthesecallsareasubstantialimprovementoverexistingCNVcallsetsandcanalsobeusefulinbenchmarkingandimprovingCNVcallingalgorithms.OurcrowdsourcingmethodologytakesthefirststeptowardshowingtheclinicalpotentialformanualcurationofCNVsatscaleandcanfurtherguideothercrowdsourcinggenomicsapplications.

16

AREPOSITORYOFMICROBIALMARKERGENESRELATEDTOHUMANHEALTHANDDISEASESFORHOSTPHENOTYPEPREDICTIONUSINGMICROBIOMEDATA

WontackHan,YuzhenYe

IndianaUniversityHan,WontackThemicrobiomeresearchisgoingthroughanevolutionarytransitionfromfocusingonthecharacterizationofreferencemicrobiomesassociatedwithdifferentenvironments/hoststothetranslationalapplications,includingusingmicrobiomefordiseasediagnosis,improvingtheefficacyofcancertreatments,andpreventionofdiseases(e.g.,usingprobiotics).Microbialmarkershavebeenidentifiedfrommicrobiomedataderivedfromcohortsofpatientswithdifferentdiseases,treatmentresponsiveness,etc,andoftenpredictorsbasedonthesemarkerswerebuiltforpredictinghostphenotypegivenamicrobiomedataset(e.g.,topredictifapersonhastype2diabetesgivenhisorhermicrobiomedata).Unfortunately,thesemicrobialmarkersandpredictorsareoftennotpublishedsoarenotreusablebyothers.Inthispaper,wereportthecurationofarepositoryofmicrobialmarkergenesandpredictorsbuiltfromthesemarkersformicrobiome-basedpredictionofhostphenotype,andacomputationalpipelinecalledMi2P(fromMicrobiometoPhenotype)forusingtherepository.Asaninitialeffort,wefocusonmicrobialmarkergenesrelatedtotwodiseases,type2diabetesandlivercirrhosis,andimmunotherapyefficacyfortwotypesofcancer,non-small-celllungcancer(NSCLC)andrenalcellcarcinoma(RCC).Wecharacterizedthemarkergenesfrommetagenomicdatausingourrecentlydevelopedsubtractiveassemblyapproach.Weshowedthatpredictorsbuiltfromthesemicrobialmarkergenescanprovidefastandreasonablyaccuratepredictionofhostphenotypegivenmicrobiomedata.Asunderstandingandmakinguseofmicrobiomedata(oursecondgenome)isbecomingvitalaswemoveforwardinthisageofprecisionhealthandprecisionmedicine,webelievethatsucharepositorywillbeusefulforenablingtranslationalapplicationsofmicrobiomedata.

17

AICM:AGENUINEFRAMEWORKFORCORRECTINGINCONSISTENCYBETWEENLARGEPHARMACOGENOMICSDATASETS

ZhiyueTomHu1,YutingYe1,PatrickA.Newbury2,HaiyanHuang2,3,4,BinChen5

1UniversityofCaliforniaBerkeley,DepartmentofBiostatistics;1UniversityofCaliforniaBerkeley,DepartmentofBiostatistics;2UniversityofCaliforniaBerkeley,Departmentof

PediatricsandHumanDevelopment;3MichiganStateUniversity,DepartmentofStatistics,4UniversityofCaliforniaBerkeley,DepartmentofPharmacologyand

Toxicology;5MichiganStateUniversityhu,ZhiyueTheinconsistencyofopenpharmacogenomicsdatasetsproducedbydifferentstudieslimitstheusageofsuchdatasetsinmanytasks,suchasbiomarkerdiscovery.Investigationofmultiplepharmacogenomicsdatasetsconfirmedthatthepairwisesensitivitydatacorrelationbetweendrugs,orrows,acrossdifferentstudies(drug-wise)isrelativelylow,whilethepairwisesensitivitydatacorrelationbetweencell-lines,orcolumns,acrossdifferentstudies(cell-wise)isconsiderablystrong.Thiscommoninterestingobservationacrossmultiplepharmacogenomicsdatasetssuggeststheexistenceofsubtleconsistencyamongthedifferentstudies(i.e.,strongcell-wisecorrelation).However,significantnoisesarealsoshown(i.e.,weakdrug-wisecorrelation)andhavepreventedresearchersfromcomfortablyusingthedatadirectly.Motivatedbythisobservation,weproposeanovelframeworkforaddressingtheinconsistencybetweenlarge-scalepharmacogenomicsdatasets.Ourmethodcansignificantlyboostthedrug-wisecorrelationandcanbeeasilyappliedtore-summarizedandnormalizeddatasetsproposedbyothers.Wealsoinvestigateouralgorithmbasedonmanydifferentcriteriatodemonstratethatthecorrecteddatasetsarenotonlyconsistent,butalsobiologicallymeaningful.Eventually,weproposetoextendourmainalgorithmintoaframework,sothatinthefuturewhenmoredatasetsbecomepubliclyavailable,ourframeworkcanhopefullyoffera"ground-truth"guidanceforreferences.

18

INTEGRATINGRNAEXPRESSIONANDVISUALFEATURESFORIMMUNEINFILTRATEPREDICTION

DerekReiman1,LingdaoSha1,IrvinHo1,TimothyTan2,DeniseLau1,AlyA.Khan3

1TempusLabs,2NorthwesternUniversity,3ToyotaTechnologicalInstituteatChicagoKhan,AlyPatientresponsestocancerimmunotherapyareshapedbytheiruniquegenomiclandscapeandtumormicroenvironment.Clinicaladvancesinimmunotherapyarechangingthetreatmentlandscapebyenhancingapatient'simmuneresponsetoeliminatecancercells.Whilethisprovidespotentiallybeneficialtreatmentoptionsformanypatients,onlyaminorityofthesepatientsrespondtoimmunotherapy.Inthiswork,weexaminedRNA-seqdataanddigitalpathologyimagesfromindividualpatienttumorstomoreaccuratelycharacterizethetumor-immunemicroenvironment.Severalstudiesimplicateaninflamedmicroenvironmentandincreasedpercentageoftumorinfiltratingimmunecellswithbetterresponsetospecificimmunotherapiesincertaincancertypes.WedevelopedNEXT(Neural-basedmodelsforintegratinggeneEXpressionandvisualTexturefeatures)tomoreaccuratelymodelimmuneinfiltrationinsolidtumors.TodemonstratetheutilityoftheNEXTframework,wepredictedimmuneinfiltratesacrossfourdifferentcancertypesandevaluatedourpredictionsagainstexpertpathologyreview.Ouranalysesdemonstratethatintegrationofimagingfeaturesimprovespredictionoftheimmuneinfiltrate.Ofnote,thiseffectwaspreferentiallyobservedforBcellsandCD8Tcells.Insum,ourworkeffectivelyintegratesbothRNA-seqandimagingdatainaclinicalsettingandprovidesamorereliableandaccuratepredictionoftheimmunecompositioninindividualpatienttumors.

19

OUTGROUPMACHINELEARNINGAPPROACHIDENTIFIESSINGLENUCLEOTIDEVARIANTSINNONCODINGDNAASSOCIATEDWITHAUTISMSPECTRUM

DISORDER

MayaVarma,KelleyMariePaskov,Jae-YoonJung,BriannaSierraChrisman,NateTylerStockham,PeterYigitcanWashington,DennisPaulWall

StanfordUniversity

Varma,MayaAutismspectrumdisorder(ASD)isaheritableneurodevelopmentaldisorderaffecting1in59children.Whilenoncodinggeneticvariationhasbeenshowntoplayamajorroleinmanycomplexdisorders,thecontributionoftheseregionstoASDsusceptibilityremainsunclear.GeneticanalysesofASDtypicallyuseunaffectedfamilymembersascontrols;however,wehypothesizethatthismethoddoesnoteffectivelyelevatevariantsignalinthenoncodingregionduetofamilymembershavingsubclinicalphenotypesarisingfromcommongeneticmechanisms.Inthisstudy,weuseaseparate,unrelatedoutgroupofindividualswithprogressivesupranuclearpalsy(PSP),aneurodegenerativeconditionwithnoknownetiologicaloverlapwithASD,asacontrolpopulation.Weusewholegenomesequencingdatafromalargecohortof2182childrenwithASDand379controlswithPSP,sequencedatthesamefacilitywiththesamemachinesandvariantcallingpipeline,inordertoinvestigatetheroleofnoncodingvariationintheASDphenotype.Weanalyzesevenmajortypesofnoncodingvariants:microRNAs,humanacceleratedregions,hypersensitivesites,transcriptionfactorbindingsites,DNArepeatsequences,simplerepeatsequences,andCpGislands.Afteridentifyingandremovingbatcheffectsbetweenthetwogroups,wetrainedanl1-regularizedlogisticregressionclassifiertopredictASDstatusfromeachsetofvariants.Theclassifiertrainedonsimplerepeatsequencesperformedwellonaheld-outtestset(AUC-ROC=0.960);thisclassifierwasalsoabletodifferentiateASDcasesfromcontrolswhenappliedtoacompletelyindependentdataset(AUC-ROC=0.960).ThissuggeststhatvariationinsimplerepeatregionsispredictiveoftheASDphenotypeandmaycontributetoASDrisk.Ourresultsshowtheimportanceofthenoncodingregionandtheutilityofindependentcontrolgroupsineffectivelylinkinggeneticvariationtodiseasephenotypeforcomplexdisorders.

20

PRECISIONDRUGREPURPOSINGVIACONVERGENTEQTL-BASEDMOLECULESANDPATHWAYTARGETINGINDEPENDENTDISEASE-ASSOCIATED

POLYMORPHISMS

FrancescaVitali1,2,JoanneBerghout1,2,3,JungweiFan1,2,JianrongLi1,QikeLi1,HaiquanLi1,2,4,YvesA.Lussier1,2,3,5

1CenterforBiomedicalInformaticsandBiostatistics(CB2)ofTheUniversityofArizona,2DepartmentofMedicineCOM-TofTheUniversityofArizona,3TheCenterforApplied

GeneticsandGenomicsinMedicineofTheUniversityofArizona,4DepartmentofBiosystemsEngineeringofTheUniversityofArizona,5UACancerCenterUAHealth

Science(UAHS)ofTheUniversityofArizonaVitali,FrancescaRepurposingexistingdrugsfornewtherapeuticindicationscanimprovesuccessratesandstreamlinedevelopment.Useoflarge-scalebiomedicaldatarepositories,includingeQTLregulatoryrelationshipsandgenome-widediseaseriskassociations,offersopportunitiestoproposenovelindicationsfordrugstargetingcommonorconvergentmolecularcandidatesassociatedtotwoormorediseases.Thisproposednovelcomputationalapproachscalesacross262complexdiseases,buildingamulti-partitehierarchicalnetworkintegrating(i)GWAS-derivedSNP-to-diseaseassociations,(ii)eQTL-derivedSNP-to-eGeneassociationsincorporatingbothcis-andtrans-relationshipsfrom19tissues,(iii)proteintarget-to-drug,and(iv)drug-to-diseaseindicationswith(iv)GeneOntology-basedinformationtheoreticsemantic(ITS)similaritycalculatedbetweenproteintargetfunctions.OurhypothesisisthatiftwodiseasesareassociatedtoacommonorfunctionallysimilareGene-andadrugtargetingthateGene/proteininonediseaseexists-theseconddiseasebecomesapotentialrepurposingindication.Toexplorethis,allpossiblepairsofindependentlysegregatingGWAS-derivedSNPsweregenerated,andastatisticalnetworkofsimilaritywithineachSNP-SNPpairwascalculatedaccordingtoscale-freeoverrepresentationofconvergentbiologicalprocessesactivityinregulatedeGenes(ITSeGENE-eGENE)andscale-freeoverrepresentationofcommoneGenetargetsbetweenthetwoSNPs(ITSSNP-SNP).SignificanceofITSSNP-SNPwasconservativelyestimatedusingempiricalscale-freepermutationresamplingkeepingthenode-degreeconstantforeachmoleculeineachpermutation.Weidentified26newdrugrepurposingindicationcandidatesspanning89GWASdiseases,includingapotentialrepurposingofthecalcium-channelblockerVerapamilfromcoronarydiseasetogout.PredictionsfromourapproacharecomparedtoknowndrugindicationsusingDrugBankasagoldstandard(oddsratio=13.1,p-value=2.49x10-8).Becauseofspecificdisease-SNPsassociationstocandidatedrugtargets,theproposedmethodprovidesevidenceforfutureprecisiondrugrepositioningtoapatient’sspecificpolymorphisms.

21

DETECTINGPOTENTIALPLEIOTROPYACROSSCARDIOVASCULARANDNEUROLOGICALDISEASESUSINGUNIVARIATE,BIVARIATE,ANDMULTIVARIATE

METHODSON43,870INDIVIDUALSFROMTHEEMERGENETWORK

XinyuanZhang1,YogasudhaVeturi1,ShefaliS.Verma1,WilliamBone1,AnuragVerma1,AnastasiaM.Lucas1,ScottHebbring2,JoshuaC.Denny3,IanStanaway4,GailP.Jarvik4,DavidCrosslin4,EricB.Larson5,LauraRasmussen-Torvik6,SarahA.Pendergrass7,JordanW.Smoller8,HakonHakonarson9,PatrickSleiman9,ChunhuaWeng10,DavidFasel10,Wei-

QiWei3,IftikharKullo11,DanielSchaid11,WendyK.Chung10,MarylynD.Ritchie1

1UniversityofPennsylvania,2MarshfieldClinic,3VanderbiltUniversity,4UniversityofWashington,5KaiserPermanenteWashingtonHealthResearchInstitute,6Northwestern

University,7GeisingerHealthSystem,8MassachusettsGeneralHospital,9Children'sHospitalofPhiladelphia,10ColumbiaUniversity,11MayoClinic

Zhang,XinyuanThelinkbetweencardiovasculardiseasesandneurologicaldisordershasbeenwidelyobservedintheagingpopulation.Diseasepreventionandtreatmentrelyonunderstandingthepotentialgeneticnexusofmultiplediseasesinthesecategories.Inthisstudy,wewereinterestedindetectingpleiotropy,orthephenomenoninwhichageneticvariantinfluencesmorethanonephenotype.Marker-phenotypeassociationapproachescanbegroupedintounivariate,bivariate,andmultivariatecategoriesbasedonthenumberofphenotypesconsideredatonetime.HereweappliedonestatisticalmethodpercategoryfollowedbyaneQTLcolocalizationanalysistoidentifypotentialpleiotropicvariantsthatcontributetothelinkbetweencardiovascularandneurologicaldiseases.Weperformedouranalyseson~530,000commonSNPscoupledwith65electronichealthrecord(EHR)-basedphenotypesin43,870unrelatedEuropeanadultsfromtheElectronicMedicalRecordsandGenomics(eMERGE)network.Therewere31variantsidentifiedbyallthreemethodsthatshowedsignificantassociationsacrosslateonsetcardiac-andneurologic-diseases.Wefurtherinvestigatedfunctionalimplicationsofgeneexpressiononthedetected“leadSNPs”viacolocalizationanalysis,providingadeeperunderstandingofthediscoveredassociations.Insummary,wepresenttheframeworkandlandscapefordetectingpotentialpleiotropyusingunivariate,bivariate,multivariate,andcolocalizationmethods.Furtherexplorationofthesepotentiallypleiotropicgeneticvariantswillworktowardunderstandingdiseasecausingmechanismsacrosscardiovascularandneurologicaldiseasesandmayassistinconsideringdiseasepreventionaswellasdrugrepositioninginfutureresearch.

22

SINGLECELLANALYSIS–WHATISTHEFUTURE?

PROCEEDINGSPAPERSWITHORALPRESENTATIONS

23

LISA:ACCURATERECONSTRUCTIONOFCELLTRAJECTORYANDPSEUDO-TIMEFORMASSIVESINGLECELLRNA-SEQDATA

YangChen1,YupingZhang2,ZhengqingOuyang1

1TheJacksonLaboratoryforGenomicMedicine,2UniversityofConnecticutOuyang,ZhengqingCelltrajectoryreconstructionbasedonsinglecellRNAsequencingisimportantforobtainingthelandscapeofdifferentcelltypesanddiscoveringcellfatetransitions.Despiteintenseeffort,analyzingmassivesinglecellRNA-seqdatasetsisstillchallenging.WeproposeanewmethodnamedLandmarkIsomapforSingle-cellAnalysis(LISA).LISAisanunsupervisedapproachtobuildcelltrajectoryandcomputepseudo-timeintheisometricembeddingbasedongeodesicdistances.TheadvantagesofLISAinclude:(1)Itutilizesk-nearest-neighborgraphandhierarchicalclusteringtoidentifycellclusters,peaksandvalleysinlow-dimensionrepresentationofthedata;(2)BasedonLandmarkIsomap,itconstructsthemaingeometricstructureofcelllineages;(3)Itprojectscellstotheedgesofthemaincelltrajectorytogeneratetheglobalpseudo-time.AssessmentsonsimulatedandrealdatasetsdemonstratetheadvantagesofLISAoncelltrajectoryandpseudo-timereconstructioncomparedtoMonocle2andTSCAN.LISAisaccurate,fast,andrequireslessmemoryusage,allowingitsapplicationstomassivesinglecelldatasetsgeneratedfromcurrentexperimentalplatforms.

24

PARAMETERTUNINGISAKEYPARTOFDIMENSIONALITYREDUCTIONVIADEEPVARIATIONALAUTOENCODERSFORSINGLECELLRNATRANSCRIPTOMICS

QiwenHu,CaseyS.Greene

UniversityofPennsylvaniaHu,QiwenSingle-cellRNAsequencing(scRNA-seq)isapowerfultooltoprofilethetranscriptomesofalargenumberofindividualcellsatahighresolution.Thesedatausuallycontainmeasurementsofgeneexpressionformanygenesinthousandsortensofthousandsofcells,thoughsomedatasetsnowreachthemillion-cellmark.Projectinghigh-dimensionalscRNA-seqdataintoalowdimensionalspaceaidsdownstreamanalysisanddatavisualization.Manyrecentpreprintsaccomplishthisusingvariationalautoencoders(VAE),generativemodelsthatlearnunderlyingstructureofdatabycompressitintoaconstrained,lowdimensionalspace.ThelowdimensionalspacesgeneratedbyVAEshaverevealedcomplexpatternsandnovelbiologicalsignalsfromlarge-scalegeneexpressiondataanddrugresponsepredictions.Here,weevaluateasimpleVAEapproachforgeneexpressiondata,Tybalt,bytrainingandmeasuringitsperformanceonsetsofsimulatedscRNA-seqdata.Wefindanumberofcounter-intuitiveperformancefeatures:i.e.,deeperneuralnetworkscanstrugglewhendatasetscontainmoreobservationsundersomeparameterconfigurations.Weshowthatthesemethodsarehighlysensitivetoparametertuning:whentuned,theperformanceoftheTybaltmodel,whichwasnotoptimizedforscRNA-seqdata,outperformsotherpopulardimensionreductionapproaches–PCA,ZIFA,UMAPandt-SNE.Ontheotherhand,withouttuningperformancecanalsoberemarkablypooronthesamedata.Ourresultsshoulddiscourageauthorsandreviewersfromrelyingonself-reportedperformancecomparisonstoevaluatetherelativevalueofcontributionsinthisareaatthistime.Instead,werecommendthatattemptstocompareorbenchmarkautoencodermethodsforscRNA-seqdatabeperformedbydisinterestedthirdpartiesorbymethodsdevelopersonlyonunseenbenchmarkdatathatareprovidedtoallparticipantssimultaneouslybecausethepotentialforperformancedifferencesduetounequalparametertuningissohigh.

25

TOPOLOGICALMETHODSFORVISUALIZATIONANDANALYSISOFHIGHDIMENSIONALSINGLE-CELLRNASEQUENCINGDATA

TongxinWang1,TravisJohnson2,JieZhang3,KunHuang4,5

1DepartmentofComputerScience,IndianaUniversityBloomington;2DepartmentofBiomedicalInforamtics,OhioStateUniversity;3DepartmentofMedicalandMolecularGenetics,IndianaUniversitySchoolofMedicine;4DepartmentofMedicine,Indiana

UniversitySchoolofMedicine;5RegenstriefInstituteWang,TongxinSingle-cellRNAsequencing(scRNA-seq)techniqueshavebeenverypowerfulinanalyzingheterogeneouscellpopulationandidentifyingcelltypes.VisualizingscRNA-seqdatacanhelpresearcherseffectivelyextractmeaningfulbiologicalinformationandmakenewdiscoveries.WhilecommonlyusedscRNA-seqvisualizationmethods,suchast-SNE,areusefulindetectingcellclusters,theyoftentearaparttheintrinsiccontinuousstructureingeneexpressionprofiles.TopologicalDataAnalysis(TDA)approacheslikeMappercapturetheshapeofdatabyrepresentingdataastopologicalnetworks.TDAapproachesarerobusttonoiseanddifferentplatforms,whilepreservingthelocalityanddatacontinuity.Moreover,insteadofanalyzingthewholedataset,Mapperallowsresearcherstoexplorebiologicalmeaningsofspecificpathwaysandgenesbyusingdifferentfilterfunctions.Inthispaper,weappliedMappertovisualizescRNA-seqdata.Ourmethodcannotonlycapturetheclusteringstructureofcells,butalsopreservethecontinuousgeneexpressiontopologiesofcells.Wedemonstratedthatbycombiningwithgeneco-expressionnetworkanalysis,ourmethodcanrevealdifferentialexpressionpatternsofgeneco-expressionmodulesalongtheMappervisualization.

26

WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA

PROCEEDINGSPAPERSWITHORALPRESENTATIONS

27

LEVERAGINGSUMMARYSTATISTICSTOMAKEINFERENCESABOUTCOMPLEXPHENOTYPESINLARGEBIOBANKS

AngelaGasdaska1,DerekFriend2,RachelChen3,JasonWestra4,MatthewZawistowski5,WilliamLindsey4,NathanTintle4

1EmoryUniversity,2UniversityofNevadaReno,3NorthCarolinaStateUniversity,4Dordt

College,5UniversityofMichiganAnnArborTintle,NathanAsgeneticsequencingbecomeslessexpensiveanddatasetslinkinggeneticdataandmedicalrecords(e.g.,Biobanks)becomelargerandmorecommon,issuesofdataprivacyandcomputationalchallengesbecomemorenecessarytoaddressinordertorealizethebenefitsofthesedatasets.Onepossibilityforalleviatingtheseissuesisthroughtheuseofalready-computedsummarystatistics(e.g.,slopesandstandarderrorsfromaregressionmodelofaphenotypeonagenotype).Ifgroupssharesummarystatisticsfromtheiranalysesofbiobanks,manyoftheprivacyissuesandcomputationalchallengesconcerningtheaccessofthesedatacouldbebypassed.Inthispaperweexplorethepossibilityofusingsummarystatisticsfromsimplelinearmodelsofphenotypeongenotypeinordertomakeinferencesaboutmorecomplexphenotypes(thosethatarederivedfromtwoormoresimplephenotypes).Weprovideexactformulasfortheslope,intercept,andstandarderroroftheslopeforlinearregressionswhencombiningphenotypes.Derivedequationsarevalidatedviasimulationandtestedonarealdatasetexploringthegeneticsoffattyacids.

28

EVALUATIONOFPATIENTRE-IDENTIFICATIONUSINGLABORATORYTESTORDERSANDMITIGATIONVIALATENTSPACEVARIABLES

KippW.Johnson1,JessicaK.DeFreitas1,BenjaminS.Glicksberg1,JasonR.Bobe1,JoelT.Dudley2

1InstituteforNextGenerationHealthcare-DepartmentofGeneticsandGenomicsSciences-IcahnSchoolofMedicineatMountSinai,2BakarComputationalHealth

SciencesInstituteTheUniversityofCaliforniaSanFranciscoDeFreitas,JessicaAvarietyofclinicaldataabstractedandanonymizedfromelectronichealthrecords(EHR)areoftenusedforresearchpurposes.Oneconsistentconcernwiththistypeofresearchistheriskforre-identificationofpatientsfromtheiranonymizeddata.Here,weusetheEHRof731,850patientstodemonstratethattheaveragepatientisuniquefromallothers98.4%ofthetimesimplybyexaminingwhatlaboratorytestshavebeenorderedforthem.Bythetimeapatienthasvisitedthehospitalontwoseparatedays,theyareuniquein74.2%ofcases.Wefurtherpresentacomputationalstudytoidentifyhowaccuratelytherecordsfromasingledayofcarecanbeusedtore-identifypatientsfromasetof99otherpatients.Weshowthat,givenasinglevisit’slaboratoryordersforapatient,wecanre-identifythepatientatleast25%ofthetime.Furthermore,wecanplacethispatientamongthetop10mostsimilarpatients47%ofthetime.Finally,wepresentaproof-of-concepttechniqueusingavariationalautoencodertoencodelaboratoryresultsintoalower-dimensionallatentspace.Wedemonstratethatreleasinglatent-spaceencodedlaboratoryorderssignificantlyimprovesprivacycomparedtoreleasingrawlaboratoryorders(<5%re-identification),whilepreservinginformationcontainedwithinthelaboratoryorders(AUCof>0.9forrecreatingencodedvalues).Ourfindingspotentiallyhaveconsequencesforthepublicreleaseofanonymizedlaboratoryteststothebiomedicalresearchcommunity.Wewishtonotethatourfindingsdonotimplythatlaboratorytestsalonearepersonallyidentifiable,butwouldrequireathreatactorhavinganexternalsourceoflaboratoryvalueswhicharelinkedtopersonalidentifierstobeginwith.

29

PROTECTINGGENOMICDATAPRIVACYWITHPROBABILISTICMODELING

SeanSimmons1,BonnieBerger2,CenkSahinalp3

1BroadInstitute,2MIT,3IndianaUniversitySimmons,SeanTheproliferationofsequencingtechnologiesinbiomedicalresearchhasraisedmanynewprivacyconcerns.Theseincludeconcernsoverthepublicationofaggregatedataatagenomicscale(e.g.minorallelefrequencies,regressioncoefficients).Methodssuchasdifferentialprivacycanovercometheseconcernsbyprovidingstrongprivacyguarantees,butcomeatthecostofgreatlyperturbingtheresultsoftheanalysisofinterest.Hereweinvestigateanalternativeapproachforachievingprivacy-preservingaggregategenomicdatasharingwithoutthehighcosttoaccuracyofdifferentiallyprivatemethods.Inparticular,wedemonstratehowotherideasfromthestatisticaldisclosurecontrolliterature(inparticular,theideaofdisclosurerisk)canbeappliedtoaggregatedatatohelpensureprivacy.ThisisachievedbycombiningminimalamountsofperturbationwithBayesianstatisticsandMarkovChainMonteCarlotechniques.WetestourtechniqueonaGWASdatasettodemonstrateitsutilityinpractice.Animplementationisavailableathttps://github.com/seanken/PrivMCMC.

30

PATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK

PROCEEDINGSPAPERSWITHPOSTERPRESENTATIONS

31

SNPS2CHIP:LATENTFACTORSOFCHIP-SEQTOINFERFUNCTIONSOFNON-CODINGSNPS

ShankaraAnand,LaurynasKalesinskas,CraigSmail,YosukeTanigawa

StanfordUniversityTanigawa,YosukeGeneticvariationsofthehumangenomearelinkedtomanydiseasephenotypes.Whilewhole-genomesequencingandgenome-wideassociationstudies(GWAS)haveuncoveredanumberofgenotype-phenotypeassociations,theirfunctionalinterpretationremainschallenginggivenmostsinglenucleotidepolymorphisms(SNPs)fallintothenon-codingregionofthegenome.Advancesinchromatinimmunoprecipitationsequencing(ChIP-seq)havemadelarge-scalerepositoriesofepigeneticdataavailable,allowinginvestigationofcoordinatedmechanismsofepigeneticmarkersandtranscriptionalregulationandtheirinfluenceonbiologicalfunction.Toaddressthis,weproposeSNPs2ChIP,amethodtoinferbiologicalfunctionsofnon-codingvariantsthroughunsupervisedstatisticallearningmethodsappliedtopublicly-availableepigeneticdatasets.WesystematicallycharacterizedlatentfactorsbyapplyingsingularvaluedecompositiontoChIP-seqtracksoflymphoblastoidcelllines,andannotatedthebiologicalfunctionofeachlatentfactorusingthegenomicregionenrichmentanalysistool.Usingtheseannotatedlatentfactorsasreference,wedevelopedSNPs2ChIP,apipelinethattakesgenomicregion(s)asaninput,identifiestherelevantlatentfactorswithquantitativescores,andreturnsthemalongwiththeirinferredfunctions.Asacasestudy,wefocusedonsystemiclupuserythematosusanddemonstratedourmethod'sabilitytoinferrelevantbiologicalfunction.WesystematicallyappliedSNPs2ChIPonpubliclyavailabledatasets,includingknownGWASassociationsfromtheGWAScatalogueandChIP-seqpeaksfromapreviouslypublishedstudy.Ourapproachtoleveragelatentpatternsacrossgenome-wideepigeneticdatasetstoinferthebiologicalfunctionwilladvanceunderstandingofthegeneticsofhumandiseasesbyacceleratingtheinterpretationofnon-codinggenomes.

32

DNASTEGANALYSISUSINGDEEPRECURRENTNEURALNETWORKS

HoBae1,ByunghanLee2,3,SunyoungKwon2,4,SungrohYoon1,2,5

1InterdisciplinaryPrograminBioinformatics,SeoulNationalUniversity;2ElectricalandComputerEngineering,SeoulNationalUniversity;3ElectronicandITMediaEngineering,SeoulNationalUniversityofScienceandTechnology;4ClovaAIResearch,NAVERCorp;

5ASRIandINMC,SeoulNationalUniversityBae,HoRecentadvancesinnext-generationsequencingtechnologieshavefacilitatedtheuseofdeoxyribonucleicacid(DNA)asanovelcovertchannelsinsteganography.Therearevariousmethodsthatexistinotherdomainstodetecthiddenmessagesinconventionalcovertchannels.However,theyhavenotbeenappliedtoDNAsteganography.Thecurrentmostcommondetectionapproaches,namelyfrequencyanalysis-basedmethods,oftenoverlookimportantsignalswhendirectlyappliedtoDNAsteganographybecausethosemethodsdependonthedistributionofthenumberofsequencecharacters.Toaddressthislimitation,weproposeageneralsequencelearning-basedDNAsteganalysisframework.Theproposedapproachlearnstheintrinsicdistributionofcodingandnon-codingsequencesanddetectshiddenmessagesbyexploitingdistributionvariationsafterhidingthesemessages.Usingdeeprecurrentneuralnetworks(RNNs),ourframeworkidentifiesthedistributionvariationsbyusingtheclassificationscoretopredictwhetherasequenceistobeacodingornon-codingsequence.Wecompareourproposedmethodtovariousexistingmethodsandbiologicalsequenceanalysismethodsimplementedontopofourframework.Accordingtoourexperimentalresults,ourapproachdeliversarobustdetectionperformancecomparedtoothertools.

33

LEARNINGCONTEXTUALHIERARCHICALSTRUCTUREOFMEDICALCONCEPTSWITHPOINCAIRÉEMBEDDINGSTOCLARIFYPHENOTYPES

BrettK.Beaulieu-Jones,IsaacS.Kohane,AndrewL.Beam

HarvardMedicalSchoolBeaulieu-Jones,BrettBiomedicalassociationstudiesareincreasinglydoneusingclinicalconcepts,andinparticulardiagnosticcodesfromclinicaldatarepositoriesasphenotypes.Clinicalconceptscanberepresentedinameaningful,vectorspaceusingwordembeddingmodels.Theseembeddingsallowforcomparisonbetweenclinicalconceptsorforstraightforwardinputtomachinelearningmodels.Usingtraditionalapproaches,goodrepresentationsrequirehighdimensionality,makingdownstreamtaskssuchasvisualizationmoredifficult.WeappliedPoincaréembeddingsina2-dimensionalhyperbolicspacetoalarge-scaleadministrativeclaimsdatabaseandshowperformancecomparableto100-dimensionalembeddingsinaeuclideanspace.Wethenexaminediseaserelationshipsunderdifferentdiseasecontextstobetterunderstandpotentialphenotypes.

34

EXPLORINGMICRORNAREGULATIONOFCANCERWITHCONTEXT-AWAREDEEPCANCERCLASSIFIER

BlakePyman,AlirezaSedghi,ShekoofehAzizi,KathrinTyryshkin,NeilRenwick,ParvinMousavi

Queen'sUniversity

Pyman,BlakeBackground:MicroRNAs(miRNAs)aresmall,non-codingRNAthatregulategeneexpressionthroughpost-transcriptionalsilencing.DifferentialexpressionobservedinmiRNAs,combinedwithadvancementsindeeplearning(DL),havethepotentialtoimprovecancerclassificationbymodellingnon-linearmiRNA-phenotypeassociations.WeproposeanovelmiRNA-baseddeepcancerclassifier(DCC)incorporatinggenomicandhierarchicaltissueannotation,capableofaccuratelypredictingthepresenceofcancerinwiderangeofhumantissues.Methods:miRNAexpressionprofileswereanalyzedfor1746neoplasticand3871normalsamples,across26typesofcancerinvolvingsixorgansub-structuresand68celltypes.miRNAswererankedandfilteredusingaspecificityscorerepresentingtheirinformationcontentinrelationtoneoplasticity,incorporating3levelsofhierarchicalbiologicalannotation.ADLarchitecturecomposedofstackedautoencoders(AE)andamulti-layerperceptron(MLP)wastrainedtopredictneoplasticityusing497abundantandinformativemiRNAs.AdditionalDCCsweretrainedusingexpressionofmiRNAcistronsandsequencefamilies,andcombinedasadiagnosticensemble.ImportantmiRNAswereidentifiedusingbackpropagation,andanalyzedinCytoscapeusingiCTNetandBiNGO.Results:Nestedfour-foldcross-validationwasusedtoassesstheperformanceoftheDLmodel.Themodelachievedanaccuracy,AUC/ROC,sensitivity,andspecificityof94.73\%,98.6\%,95.1\%,and94.3\%,respectively.Conclusion:DeepautoencodernetworksareapowerfultoolformodellingcomplexmiRNA-phenotypeassociationsincancer.TheproposedDCCimprovesclassificationaccuracybylearningfromthebiologicalcontextofbothsamplesandmiRNAs,usinganatomicalandgenomicannotation.AnalyzingthedeepstructureofDCCswithbackpropagationcanalsofacilitatebiologicaldiscovery,byperforminggeneontologysearchesonthemosthighlysignificantfeatures.

35

ESTIMATINGCLASSIFICATIONACCURACYINPOSITIVE-UNLABELEDLEARNING:CHARACTERIZATIONANDCORRECTIONSTRATEGIES

RashikaRamola,ShantanuJain,PredragRadivojac

NortheasternUniversityRamola,RashikaAccuratelyestimatingperformanceaccuracyofmachinelearningclassifiersisoffundamentalimportanceinbiomedicalresearchwithpotentiallysocietalconsequencesuponthedeploymentofbest-performingtoolsineverydaylife.Althoughclassificationhasbeenextensivelystudiedoverthepastdecades,thereremainunderstudiedproblemswhenthetrainingdataviolatethemainstatisticalassumptionsrelieduponforaccuratelearningandmodelcharacterization.Thisparticularlyholdstrueintheopenworldsettingwhereobservationsofaphenomenongenerallyguaranteeitspresencebuttheabsenceofsuchevidencecannotbeinterpretedastheevidenceofitsabsence.Learningfromsuchdataisoftenreferredtoaspositive-unlabeledlearning,aformofsemi-supervisedlearningwherealllabeleddatabelongtoone(say,positive)class.Toimprovethebestpracticesinthefield,weherestudythequalityofestimatedperformanceinpositive-unlabeledlearninginthebiomedicaldomain.Weprovideevidencethatsuchestimatescanbewildlyinaccurate,dependingonthefractionofpositiveexamplesintheunlabeleddataandthefractionofnegativeexamplesmislabeledaspositivesinthelabeleddata.Wethenpresentcorrectionmethodsforfoursuchmeasuresanddemonstratethattheknowledgeoraccurateestimatesofclasspriorsintheunlabeleddataandnoiseinthelabeleddataaresufficientfortherecoveryoftrueclassificationperformance.Weprovidetheoreticalsupportaswellasempiricalevidencefortheefficacyofthenewperformanceestimationmethods.

36

EXTRACTINGALLELICREADCOUNTSFROM250,000HUMANSEQUENCINGRUNSINSEQUENCEREADARCHIVE

BrianTsui,MichelleDow,DylanSkola,HannahCarter

DepartmentofMedicine,UniversityofCaliforniaSanDiego,9500GilmanDrive,SanDiego,California92093,USA

Tsui,BrianYTheSequenceReadArchive(SRA)containsoveronemillionpubliclyavailablesequencingrunsfromvariousstudiesusingavarietyofsequencinglibrarystrategies.Thesedatainherentlycontaininformationaboutunderlyinggenomicsequencevariantswhichweexploittoextractallelicreadcountsonanunprecedentedscale.Wereprocessedover250,000humansequencingruns(>1000TBdataworthofrawsequencedata)intoasingleunifieddatasetofallelicreadcountsfornearly300,000variantsofbiomedicalrelevancecuratedbyNCBIdbSNP,wheregermlinevariantsweredetectedinamedianof912sequencingruns,andsomaticvariantsweredetectedinamedianof4,876sequencingruns,suggestingthatthisdatasetfacilitatesidentificationofsequencingrunsthatharborvariantsofinterest.Allelicreadcountsobtainedusingatargetedalignmentwereverysimilartoreadcountsobtainedfromwhole-genomealignment.AnalyzingallelicreadcountdataformatchedDNAandRNAsamplesfromtumors,wefindthatRNA-seqcanalsorecovervariantsidentifiedbyWholeExomeSequencing(WXS),suggestingthatreprocessedallelicreadcountscansupportvariantdetectionacrossdifferentlibrarystrategiesinSRA.ThisstudyprovidesarichdatabaseofknownhumanvariantsacrossSRAsamplesthatcansupportfuturemeta-analysesofhumansequencevariation.

37

AUTOMATICHUMAN-LIKEMININGANDCONSTRUCTINGRELIABLEGENETICASSOCIATIONDATABASEWITHDEEPREINFORCEMENTLEARNING

HaohanWang1,XiangLiu2,YifengTao1,WentingYe1,QiaoJin3,WilliamW.Cohen4,EricP.Xing5

1CarnegieMellonUniversity,2ChineseUniversityofHongKong,3TsinghuaUniversity,

4GoogleAI,5PettumIncWang,HaohanTheincreasingamountofscientificliteratureinbiologicalandbiomedicalscienceresearchhascreatedachallengeinthecontinuousandreliablecurationofthelatestknowledgediscovered,andautomaticbiomedicaltext-mininghasbeenoneoftheanswerstothischallenge.Inthispaper,weaimtofurtherimprovethereliabilityofbiomedicaltext-miningbytrainingthesystemtodirectlysimulatethehumanbehaviorssuchasqueryingthePubMed,selectingarticlesfromqueriedresults,andreadingselectedarticlesforknowledge.Wetakeadvantageoftheefficiencyofbiomedicaltext-mining,theflexibilityofdeepreinforcementlearning,andthemassiveamountofknowledgecollectedinUMLSintoanintegrativeartificialintelligentreaderthatcanautomaticallyidentifytheauthenticarticlesandeffectivelyacquiretheknowledgeconveyedinthearticles.Weconstructasystem,whosecurrentprimarytaskistobuildthegeneticassociationdatabasebetweengenesandcomplextraitsofthehuman.Ourcontributionsinthispaperarethree-fold:1)Weproposetoimprovethereliabilityoftext-miningbybuildingasystemthatcandirectlysimulatethebehaviorofaresearcher,andwedevelopcorrespondingmethods,suchasBi-directionalLSTMfortextminingandDeepQ-Networkfororganizingbehaviors.2)Wedemonstratetheeffectivenessofoursystemwithanexampleinconstructingageneticassociationdatabase.3)Wereleaseourimplementationasagenericframeworkforresearchersinthecommunitytoconvenientlyconstructotherdatabases.

38

PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA

PROCEEDINGSPAPERWITHPOSTERPRESENTATION

39

INFLUENCEOFTISSUECONTEXTONGENEPRIORITIZATIONFORPREDICTEDTRANSCRIPTOME-WIDEASSOCIATIONSTUDIES

BinglanLi1,YogasudhaVeturi1,YukiBradford1,ShefaliS.Verma1,AnuragVerma1,AnastasiaM.Lucas1,DavidW.Haas2,MarylynD.Ritchie1

1UniversityofPennsylvania,2VanderbiltUniversity

Ritchie,MarylynTranscriptome-wideassociationstudies(TWAS)haverecentlygainedgreatattentionduetotheirabilitytoprioritizecomplextrait-associatedgenesandpromotepotentialtherapeuticsdevelopmentforcomplexhumandiseases.TWASintegratesgenotypicdatawithexpressionquantitativetraitloci(eQTLs)topredictgeneticallyregulatedgeneexpressioncomponentsandassociatespredictionswithatraitofinterest.Assuch,TWAScanprioritizegeneswhosedifferentialexpressionscontributetothetraitofinterestandprovidemechanisticexplanationofcomplextrait(s).Tissue-specificeQTLinformationgrantsTWAStheabilitytoperformassociationanalysisontissueswhosegeneexpressionprofilesareotherwisehardtoobtain,suchasliverandheart.However,aseQTLsaretissuecontext-dependent,whetherandhowthetissue-specificityofeQTLsinfluencesTWASgeneprioritizationhasnotbeenfullyinvestigated.Inthisstudy,weaddressedthisquestionbyadoptingtwodistinctTWASmethods,PrediXcanandUTMOST,whichassumesingletissueandintegrativetissueeffectsofeQTLs,respectively.Thirty-eightbaselinelaboratorytraitsin4,360antiretroviraltreatment-naïveindividualsfromtheAIDSClinicalTrialsGroup(ACTG)studiescomprisedtheinputdatasetforTWAS.WeperformedTWASinatissue-specificmannerandobtainedatotalof430significantgene-traitassociations(q-value<0.05)acrossmultipletissues.Singletissue-basedanalysisbyPrediXcancontributed116ofthe430associationsincluding64uniquegene-traitpairsin28tissues.Integrativetissue-basedanalysisbyUTMOSTfoundtheother314significantassociationsthatinclude50uniquegene-traitpairsacrossall44tissues.Bothanalyseswereabletoreplicatesomeassociationsidentifiedinpastvariant-basedgenome-wideassociationstudies(GWAS),suchashigh-densitylipoprotein(HDL)andCETP(PrediXcan,q-value=3.2e-16).Bothanalysesalsoidentifiednovelassociations.Moreover,singletissue-basedandintegrativetissue-basedanalysisshared11of103uniquegene-traitpairs,forexample,PSRC1-low-densitylipoprotein(PrediXcan’slowestq-value=8.5e-06;UTMOST’slowestq-value=1.8e-05).Thisstudysuggeststhatsingletissue-basedanalysismayhaveperformedbetteratdiscoveringgene-traitassociationswhencombiningresultsfromalltissues.Integrativetissue-basedanalysiswasbetteratprioritizinggenesinmultipletissuesandintrait-relatedtissue.Additionalexplorationisneededtoconfirmthisconclusion.Finally,althoughsingletissue-basedandintegrativetissue-basedanalysissharedsignificantnoveldiscoveries,tissuecontext-dependencyofeQTLsimpactedTWASgeneprioritization.Thisstudyprovidespreliminarydatatosupportcontinuedworkontissuecontext-dependencyofeQTLstudiesandTWAS.

40

SINGLECELLANALYSIS–WHATISTHEFUTURE?

PROCEEDINGSPAPERWITHPOSTERPRESENTATION

41

SHALLOWSPARSELY-CONNECTEDAUTOENCODERSFORGENESETPROJECTION

MaxwellP.Gold,AlexanderLeNail,ErnestFraenkel

MassachusettsInstituteofTechnologyGold,MaxwellWhenanalyzingbiologicaldata,itcanbehelpfultoconsidergenesets,orpredefinedgroupsofbiologicallyrelatedgenes.Methodsexistforidentifyinggenesetsthataredifferentialbetweenconditions,butlargepublicdatasetsfromconsortiumprojectsandsingle-cellRNA-Sequencinghaveopenedthedoorforgenesetanalysisusingmoresophisticatedmachinelearningtechniques,suchasautoencodersandvariationalautoencoders.Wepresentshallowsparsely-connectedautoencoders(SSCAs)andvariationalautoencoders(SSCVAs)astoolsforprojectinggene-leveldataontogenesets.Wetestedtheseapproachesonsingle-cellRNA-SequencingdatafrombloodcellsandonRNA-Sequencingdatafrombreastcancerpatients.BothSSCAandSSCVAcanrecoverknownbiologicalfeaturesfromthesedatasetsandtheSSCVAmethodoftenoutperformsSSCA(andsixexistinggenesetscoringalgorithms)onclassificationandpredictiontasks.

42

WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA

PROCEEDINGSPAPERWITHPOSTERPRESENTATION

43

IMPLEMENTINGAUNIVERSALINFORMEDCONSENTPROCESSFORTHEALLOFUSRESEARCHPROGRAM

MeganDoerr1,ShiraGrayson1,SarahMoore1,ChristineSuver1,JohnWilbanks1,JenniferWagner2

1SageBionetworks,2CenterforTranslationalBioethics&HealthCarePolicyGeisinger

Doerr,MeganTheUnitedStates’AllofUsResearchProgramisalongitudinalresearchinitiativewithambitiousnationalrecruitmentgoals,includingofpopulationstraditionallyunderrepresentedinbiomedicalresearch,manyofwhomhavehighgeographicmobility.Theprogramhasadistributedinfrastructure,withkeyprogrammaticresourcesspreadacrosstheUS.Givenitsplanneddurationandgeographicreachbothintermsofrecruitmentandprogrammaticresources,adiversityofstateandterritorylawsmightapplytotheprogramovertimeaswellastothedeterminationofparticipants’rights.Herewepresentalistinganddiscussionofstateandterritoryguidanceandregulationofspecificrelevancetotheprogram,andourapproachtotheirincorporationwithintheprogram’sinformedconsentprocesses.

44

GENERAL

POSTERPRESENTATIONS

45

ACONVOLUTIONALNEURALNETPREDICTSBINDINGPROPERTIESOFANANTIBODYLIBRARY

RishiBedi,RachelHovde,JacobGlanville

DistributedBioHovde,RachelResearchbyGlanvilleetal.describedamethodthatenabledTCRsoftheadaptiveimmunesystemtobeclusteredintospecificitygroupsandalloweddenovodesignofTCRswithaparticularspecificity.Inthisstudy,weapplydeeplearningmethodstoperformcharacterizationandengineeringofantibodies.Togenerateenoughdatatoaddressthisquestionwithmachinelearningmethods,wecreatedacomputationally-optimizedantibodylibrarycapableofgeneratingthousandsofhighaffinityhitsagainstanyantigen.Byroboticallypanning11antigensinreplicateagainstthelibrary,wegenerated,sequenced,andvalidatedadatasetofover55,000uniquehighaffinitybinders.Tocharacterizethefunctionalpropertiesofthislibrary,wetrainaconvolutionalneuralnetworktopredictthebindingspecificityofeachclone.Ourmodeloutperformsalternativeapproachesandsuccessfullypredictsbindingspecificityinheld-out,increasinglydissimilartestsets.Usingthetrainedmodeltoperformoptimizationontheinputsequence,wegeneratecharacteristicclassexamples,aswellas"foolingsequences"thatrepresenttheboundariesbetweenpairsofbindingspecificities.Weusethereal-valuedoutputoftheconvolutionalandlinearlayersofthenetworkasanembeddinganddemonstratephysically-meaningfulclustering.Thesetechniquesletusassessthecontributionofparticularmotifstothelock-and-keyinteractionwiththetargetantigen,andenablevirtual"epitopebinning"todistinguishantibodiesinourlibrarythatbindsimilarepitopes.Thisenablesfutureworkinvirtualmutagenesis,whereweleveragetheseinsightstogenerateantibodiesthatexhibitdesirablebindingproperties.

46

CNVAR:ASOFTWARETOOLFORGENOTYPINGCYP2D6USINGSHORTREADNEXTGENERATIONSEQUENCINGTECHNOLOGY

JohnLoganBlackIIIMD1,HuguesSicottePhD1,SandraE.Peterson1,KimberleyJ.Harris1,LieweiWangMDPhD1,StevenSchererPhD2,EricBoerwinklePhD2,RichardA.Gibbs

PhD2,SuzetteJ.BielinskiPhD1,RichardWeinshilboumMD1

1MayoClinic,2BaylorCollegeofMedicine

Black,JohnIntroduction:CYP2D6isanimportantpharmacogeneinvolvedinthemetabolismofmanymedications.CYP2D6isknowntohavenumerouscopynumbervariations(CNV)includinggeneduplications/multiplications,genedeletion,andhybridgenesinvolvingthepseudogene,CYP2D7.SoftwarethatenablesthegenotypingofCYP2D6fromshortreadnextgenerationsequencing(NGS)isurgentlyneededtocost-effectivelyandaccuratelydetermineclinicalCYP2D6phenotypes.Methods:ModellingofexpectedratiosforspecificgeneregionswithandwithoutCNVwasdonebasedupontheknownconfigurationsoftheCYP2Dlocus.ThisdatawasusedtogeneratetheCNVARsoftwarewhichanalyzesvcfandbamfilestodeterminevariantallelicratiosandreaddepthforallexonsandthepromotersoftheCYP2D6andCYP2D7genesafterNGS.ThesoftwareusesstatisticalmethodstodetecttheCNVsandemploysmultiplequalitymetricstodeterminethebestfitforpossiblegenotypesolutions.Italsodetectsnamedhaplotypesplusanynovelvariants.CNVARwaspreviouslyvalidatedagainst500sampleswithknowngenotypesdeterminedbytargetedgenotypingandSangersequencing.SamplessequencedaspartoftheMayoClinicCenterforIndividualizedMedicine'sRIGHT10KStudyforPharmacogenomicsarenowbeinganalyzed.SequencingwasdoneatBaylorCollegeofMedicine'sHumanGenomeSequencingCenterusingthereagentcalledPGx-seqandanalysisoftheCYP2D6sequenceresultsisbeingperformedinthePersonalizedGenomicsLaboratoryatMayoClinic.Results:6921sampleshavebeenanalyzedusingtheCNVARsoftwaretoderiveCYP2D6diplotypes.968(14%)sampleshadqualityflagsindicatingeitherunexpectedallelefrequencies,CNVratios,anovelvariantwasdetected,orseveraldiplotypesolutionsfitthefindingsequallywell.102(1.5%)samplesweredeterminedtohavenovelvariantsornovelhybridgenes.Alloftheremainingsamples,except55(0.79%),couldberesolvedbyvisualinspectionofCNVARoutputs.These55remainingsampleswerereferredforadditionalSangersequencingtodeterminetheactualdiplotypeandquantitativertPCRtodetermineactualcopynumber.Conclusions:CNVARisasoftwaretoolwhichcandetectCYP2D6diplotypes,CNVsandhybridgenesfromNGSshortreadtechnology.Thesoftwareidentifiessamplesthatcannotbegenotypedwithcertaintysothatadditionalevaluationcanbeperformedtoderivetheactualgenotype.Novelvariantsandhybridalleleswerealsoidentifiedsothatvariantcurationandclassificationcouldbedone.ThisworkwassupportedbyMayoClinicCenterforIndividualizedMedicineandtheRobertD.andPatriciaE.KernCenterfortheScienceofHealthCareDelivery,NationalInstitutesofHealthgrantsU19GM61388(ThePharmacogenomicsResearchNetwork),R01GM28157,U01HG005137,R01GM125633,R01AG034676(TheRochesterEpidemiologyProject),andU01HG06379andU01HG06379Supplement(TheElectronicMedicalRecordandGenomics(eMERGE)Network).

47

NETWORKANALYSISOFDISTINCTCOHORTSALLOWSFORTHECOMPARISONOFKEYBIOLOGICALFUNCTIONSRELATEDTOTBPATHOGENESIS

CarlyBobak,MeghanE.Muse,AlexanderJ.Titus,BrockC.Christensen,A.JamesO'Malley,JaneE.Hill

DartmouthCollege

Bobak,CarlyChallengeswithreproducibilityofmicroarraydatasetscanlimittheabilitytoanalyzeandinterpretintegratedgeneexpressiondatasets.Oneapproachtotacklereproducibilityacrossmicroarraydatasetsbuildsamulti-cohortframeworkusingpubliclyavailabledatatobettermirrordiversepopulationsseeninclinics.Analternativewayofincreasingthereproducibilityofresultsisemphasizingunderlyingpathwayornetworklevelanalyses.Whiledifferentialexpressionofgenesmayvarybetweendatasetsanddataanalysistechniques,thebiologicalprocessesunderlyinggeneexpressionaremorerobust.Theresultsfromtheseanalysescandrivehypothesesregardingthebiologicalmechanismsbehinddiseases.Weproposeusingamulti-cohortdesignandapathway-levelgeneexpressionanalysistoidentifykeybiologicalprocessesinactiveTuberculosis(TB)disease.Amulti-cohortapproachisparticularlyimportantwhenanalyzingTBbecausephenotypicpresentationofthediseasediffersamongpatients,especiallythosewhoareco-infectedwithhumanimmunodeficiencyvirus(HIV),oramongchildren.Assuch,thesesubgroupsareoftenexcludedfromstudiesexamininghumangeneexpressionarraydata.However,in2016,10%ofincidentTBcaseswerepeoplelivingwithHIV,and10%werechildren,anddespitethedifficultyofstudyingthesepopulationsalongsideadults,theymakeupasubstantialproportionofbothcurrentlyTBinfectedandtheoverallTBsusceptiblepopulation.TovisualizedifferencesacrosscohortscontainingtheseTBsubgroups,weuseanapproachcalledanEnrichmentMapwhichallowsustorepresenteachdistinctdatasetinonenetwork.Weselectedthreerepresentativepubliclyavailabledatasets(n=1148)andusedDifferentialExpressionandGeneSetEnrichmentAnalysis.Genesetswhichweresignificantlyenrichedbecamethenodesofthenetwork,withedgesrepresentativeoftheoverlapbetweenthesegenesets.TheresultsofthesecombinedanalyseswereusedasaninputtoEnrichmentMap,toclusterandannotateimportantbiologicalfunctions.TheEnrichmentMapnetworkidentifiedmanyprocessesexpectedbasedoncurrentTBknowledge,suchasinterferon-gammaactivity(6genesets).Aswell,someotherprocesseswhichrepresentpotentiallynovelinsightstothediseaseareidentified.WeexamineoneclusterofnodesrelatedtoDNAmethylation(6genesets)indepth.TheDNAmethylationgenesetwithinthisclusterwasstronglyenrichedinthedatasetwithnoHIV+patients(FDR=0.004)andappearstobeenriched,althoughinsignificantlyso,inthetwodatasetsincludingHIV+patients(FDR=0.518,0.879).FurtherunsupervisedanalysisofDNAmethylationgeneswithinthesesetsrevealsclearclusteringofactiveTBpatientsfromthosewithlatentTBinfection,irrespectiveofHIV+.Thus,wetheorizethatwhileconventionalmethodswouldnotimplicateDNAmethylationasplayingaroleinactiveTBinfection,bycomparingenrichmentsacrossdatasetsatthenetworklevelwecanobservepatternsingeneexpressionwithafinerdegreeofgranularity.

48

VARIATIONINOPIOIDPRESCRIBINGPATTERNSINSURGICALPOPULATIONS

SolineM.Boussard1,MarylynD.Ritchie2,MichelleWhirl-Carrillo3,TinaHernandez-Boussard3,TeriE.Klein3

1CastillejaSchool,2UniversityofPennsylvania,3StanfordUniversity

Boussard,SolineIntroductionInclinicalsettings,patients'responsetoopioidscanvarybyasmuchas40-fold.CommonopioidsrequiremetabolismbyliverenzymeCYP-2D6andconsiderablevariationexistsintheamountofCYP-2D6producedbyindividuals.Therefore,pharmacogenomicsmayshedlightastohowtoaddressdifferentresponsesbyfindingthemosteffectivemedicationanddosageforeachpatient.Asafirststepatidentifyingopportunitiesforpersonalizedpainmanagement,weanalyzedpostoperativepainandopioidprescribingpatternsacrossfourcommonsurgeriesknownforhighpostoperativepain.MethodsWeusedEHRstoidentifypatientsundergoing4surgeries(totalkneereplacement(TKA),thoracotomy,distalradiusfracture,andmastectomy).Themainoutcomesweredischargepainmedicationsandpostoperativepainscores.ThisresearchwaspossiblethroughtheuseofstructuredEHRdataandthemappingofmedicationstoontologies.Patientswereidentifiedusingproceduralcodes;painscores(painscoresrangefrom0to10with10beingthemostsevere)wereidentifiedfromflowsheetswithintheEHR,anddischargemedicationsweremappedtoRXNorm.Datawereaggregatedtothepatientlevel.Painscoreswereaveragedacrossdifferenttimepoints.RStudiowasusedforstatisticalcomputingandgraphics.Chi-square,t-testsandanalysisofvariancewereusedforstatisticaltesting.ResultsAtotalof63,500patientswereincluded.Themeanagewas61.31(SD:14.3),65.3%werefemale,62.1%werewhiteand13.3%wereHispanic/Latinoethnicity.Onaverage,painscoreswerelowerat30daysfollow-upcomparedtopre-operativeandpatientsreceived4.1differenttypesofopioidsduringtheirinpatientstay,withamajorityofpatientsswitchingbetweenhydrocodoneandoxycodone.Totalkneereplacementrepresented61.6%followedby20.0%thoracotomy,16.4%mastectomyand2.0%distalradiusfracture.Atdischarge,themajorityofpatientsreceivedoxycodone(69.15%)andhydrocodone(15.29%).Inmastectomy,47.89%receivedhydrocodoneand44.05%receivedoxycodone.ForTKA,78.20%receivedoxycodone,followedby8.90%receivingtramadol.Follow-uppainwassimilaracrossthe4surgeries,howeverthefollow-uppaindifferedbyopioidsreceivedwithpatientsonoxymorphonehavingthehighestfollow-uppain(6.24)andpatientsonpropoxyphenehavingthelowestpain(1.29,p<.0001).DiscussionInthisstudythatexaminespost-operativeoutcomesandprescriptionsinareal-worldsetting,opioidprescribingpatternsvariedsignificantlyacrosssurgerytype.Ourdatasuggestcodeinewasassociatedwithlowerfollow-uppaininTKAcomparedtootheropioids.Thisdatafromreal-worldevidencesuggeststhatwecanusesuchmethodologytoidentifyacohortofpatientsthatmaybetargetedforgenotypingforpersonalizedmedicine.TargetingpatientswithpoorpainrelieffromopioidsthatrequireCYP-2D6foractivationcouldidentifypatientswithgenevariationsthataffectopioidmetabolism.Futurestudiescouldlookatwhatvariantsthatcouldaffectpatients'metabolismforcodeine.

49

REGIONALHETEROGENEITYINGENEEXPRESSION,REGULATIONANDCOHERENCEINHIPPOCAMPUSANDDORSOLATERALPREFRONTALCORTEX

ACROSSDEVELOPMENTANDSCHIZOPHRENIA

LeonardoCollado-Torres1,EmilyE.Burke1,AmyPeterson1,JooHeonShin1,RIchardE.Straub1,AnanditaRajpurohit1,StephenA.Semick1,WilliamS.Ulrich1,BrainSeq

Consortium,CristianValencia1,RanTao1,AmyDeep-Soboslay1,ThomasM.Hyde1,JoelE.Kleinman1,DanielRWeinberger1,+,AndrewE.Jaffe1,+

1LieberInstituteforBrainDevelopment,Baltimore,MD,USA

Background:Wepreviouslyidentifiedwidespreadgenetic,developmental,andschizophrenia-associated(SCZD)changesinpolyadenylatedRNAsinthedorsolateralprefrontalcortex(DLPFC),butthelandscapeofhippocampal(HIPPO)expressionusingRNAsequencingislesswell-explored.

Methods:WeperformedRNA-sequsingRiboZeroon900RNA-seqsamplesacross551individuals(SCZDN=286)inDLPFC(N=453)andHIPPO(N=447).WequantifiedexpressionofmultiplefeaturesummarizationsoftheGencodev25referencetranscriptome,includinggenes,exonsandsplicejunctions.Withinandacrossbrainregions,wemodeledage-relatedchangesincontrolsusinglinearsplines,integratedgeneticdatatoperformexpressionquantitativetraitloci(eQTL)analyses,andperformeddifferentialexpressionanalysescontrollingforobservedandlatentconfounders.

Results:WeidentifiedwidespreaddevelopmentalregulationbetweentheDLPFCandHIPPOoveragingwith10,839genesdifferentiallyexpressed(Bonferroni<0.01)andreplicatinginBrainSpan(n=79tissuesamples,DLPFC=40,HIPPO=39).Ofthesegenes,5,982(55%)containdifferentiallyexpressedexonsandsplicejunctionsthatreplicatedinBrainSpan.Byextendingqualitysurrogatevariableanalysis(qSVA)tomultiplebrainregions,weidentified48and245differentiallyexpressedgenes(DEG)bySCZDdiagnosis(FDR<5%)inHIPPOandDLPFC,respectively,withsurprisinglyminimaloverlapinDEGbetweenthetwobrainregions.Wefurtheridentified205,618brainregion-dependenteQTLs(FDR<1%)andfoundthat124GWASrisklocicontaineQTLsinatleastoneoftheregions.Wealsoidentifypotentialmolecularcorrelatesofinvivoevidenceofalteredprefrontal-hippocampalfunctionalcoherenceinschizophrenia.ThroughoureQTLbrowserresourcehttp://eqtl.brainseq.org/wehavemadealleQTLssetsavailableforfurtherexploration.

Discussion:Weshowextensiveregionalspecificityofdevelopmentalandgeneticregulation,andSCZD-associatedexpressiondifferencesbetweenHIPPOandDLPFC.Theseresultsunderscorethecomplexityandregionalheterogeneityofthetranscriptionalcorrelatesofschizophrenia,andsuggestfutureschizophreniatherapeuticsmayneedtotargetmolecularpathologieslocalizedtospecificbrainregions.

50

FULL-LENGTHSEQUENCEASSEMBLYANDCHARACTERIZATIONOFHIGHLYPURIFIEDCIRCRNAISOFORMS

SupriyoDe,AmareshC.Panda,MyriamGorospe

LaboratoryofGeneticsandGenomics,NationalInstituteonAgingIRP,NIHDe,SupriyoCircularRNAsarealargeheterogenousclassofhighlystablenoncodingRNAsbuttheyarepoorlycharacterized.ManysoftwaretoolsexistforidentifyingcircularRNAsbyfindingtheircircularizingjunctions,butverylittleisknownaboutthesequenceoftheirfulllengthortheirisoforms/alternatelysplicedforms.TheassemblyandcharacterizationofisoformsisalsolimitedbythelackofmethodologiestoextracthighlypurecircRNAs.Whileexoribonuclease(RNaseR)treatmentiswidelyusedtodegradelinearRNAsandenrichcircRNAsfromtotalRNA,itdoesnotefficientlyeliminatealllinearRNAs.Thislimitationcomplicatestheassemblyprocesstogetfull-lengthcircRNAs.HerewedescribeanovelmethodforisolatinghighlypurecircRNApopulationsinvolvingRNaseRtreatmentfollowedbyPolyadenylationandpoly(A)+RNADepletion(RPAD),whichremoveslinearRNAtonearcompletion.OncetheRNApopulationishighlyenriched,sequenceassemblyalgorithmssuchasCufflinkscanbeusedtoidentifythebodyofthecircRNA,whilethecircularizing/back-splicedjunctionscanbefoundusingmanydifferentsoftwaretoolssuchasCircexplorer,CIRIetc.High-throughputsequencingofRNApreparedusingRPADfromhumancervicalcarcinomaHeLacellsandmouseC2C12myoblastsfollowedbythisnovelanalysispipelineledtoidentificationofmanycircRNAisoformswithanidenticalback-splicesequence(circularizingjunction)butwithdifferentbodysizesandsequences.AsoneofthemainfunctionsofcircRNAsisspongingregulatoryRNAsandproteins,full-lengthcharacterizationofcircRNAisoformswillbecriticalforenablingthefunctionalcharacterizationofcircRNAs.Acknowledgement:ThisresearchwassupportedbyIntramuralResearchProgramoftheNationalInstituteonAging,NIH.

51

ACOMPREHENSIVEREVIEWANDASSESSMENTOFEXISTINGPATHWAYANALYSISAPPROACHES

Tuan-MinhNguyen1,AdibShafi1,TinNguyen2,SorinDraghici1

1DeptofComputerScience,WayneStateUniversity;2DeptofComputerScience,

UniversityofNevadaDraghici,SorinInmanyhigh-throughputexperiments,itiscrucialtounderstandthebiologicalmechanismsofgenesandtheirproductsfromexpressiondata.Pathwayanalysisisacrucialstepinanyphenotypecomparisonbecauseitallowsustogaininsightsintotheunderlyingbiologicalphenomena.Becauseoftheimportanceofthistypeofanalysis,morethan35pathwayanalysismethodshavebeenproposedsofar.Thesecanbecategorizedintotwomaincategories:non-pathwaytopologybased(non-TB)andtopology-based(TB)approaches.Non-TBmethodsconsiderpathwaysassimplegenesetsandignorethepositionandroleofthegenes,aswellasthedirectionandtypeofsignalsdescribedbythepathwaywhileTBmethodsincludethisadditionalinformationintheanalysis.Althoughtherearesomereviewpapersdiscussingthistopic,therehasbeennostudythatsystematicallyassessestheperformancesofthemethodsusinganunbiasedandlargenumberofdatasetsavailable.Furthermore,themajorityofthepathwayanalysisapproachesrelyontheassumptionofuniformityofp-valuesunderthenullhypothesis,whichisnotalwaystrue.Noneoftheseexistingreviewstaketheperformancesofthestudiedmethodsunderthenullintoaccountintheircomparisons.Inordertoprovideanaccurateandobjectiveassessmentsothatresearchersandbiologistscanchooseamethodsuitablefortheirpurpose,weprovideanextensiveanalysisof11widelyusedpathwayanalysismethodsfrombothnon-TBandTBgroupsusing2601samplesfrom75humandiseasedatasetsand8methodsusing121samplesfrom11knock-outmousedatasets.Inaddition,weinvestigatetheextenttowhicheachmethodisbiasedunderthenullhypothesis.Overall,theresultshowsTBmethodsperformbetterthannon-TBmethodssincetheytakeintoconsiderationthetopologyinformationandsignalpropagation.Viapermutationandbootstrap,wediscoveranothercriticalconclusionthatmostifnotalllistedapproachesarebiasedandproduceveryskewedresultsunderthenull.

52

ANEWPHYLOGENETICSAMPLINGMETHODUSINGGENERALIZED-ENSEMBLEALGORITHM

TetsuFurukawa,HiroyukiToh

DepartmentofBiomedicalChemistry,SchoolofScienceandTechnology,Kwansei-GakuinUniversity,Sanda,Hyogo,Japan669-1337

Furukawa,TetsuBayesianinferencehasbeenwidelyutilizedfortheevolutionaryanalysisincludingphylogenetictreereconstruction,whereMonteCarlosamplingsuchasMarkovchainMonteCarlo(MCMC)orMetropolis-coupledMCMC(MC3)generatesaposteriordistribution.MonteCarlosamplingisalsoutilizedformolecularsimulationofbiopolymerslikeproteinsandDNA.Oneoftherepresentativemethodsisthereplicaexchangealgorithm,whichisequivalenttoMC3inthemolecularphylogeny.Besidesthereplicaexchangealgorithm,severaldifferentsamplingmethodshavebeendevelopedformolecularsimulation,whicharecollectivelytermedasthegeneralizedensemblealgorithm.Inthisstudy,weexaminedthepossibilitytoapplytheotheralgorithmsbelongingtothegeneralizedensemblealgorithmtothetreereconstruction,inordertodevelopmoreefficientsamplingmethodforthemolecularphylogeny.TheprogramimplementedwiththeothergeneralizedensemblealgorithmwasdevelopedbasedonthesourcecodeofBEASTversion2.5.1.Toevaluatetheperformance,artificialalignmentsweregenerated,sothattheposteriordistributionsofthecorrespondingtreesaredifficulttoberegeneratedbysampling,i.e.thedistributionwithmultiplepeaks.Weappliedourprogramandexistingtoolstotheartificialdata.Then,wecomparedtheresultssuchasthetimesrequiredfortheconvergenceandthedegreeofregenerationoftheposteriordistributions.Thebenefitandpitfallsofourprogramwillbediscussedbasedonthecomparison.

53

CONVERGENTMECHANISMSPERTURBEDBYSCATTEREDSNPSSUSCEPTIBLETOALZHEIMER'SDISEASE

JialiHan1,2,EdwinBaldwin1,JinZhou3,FeiYin4,5,HaiquanLi1,6,

1UniversityofArizona,DepartmentofBiosystemsEngineering;2UniversityofArizona,

DepartmentofSystemsandIndustrialEngineering;3UniversityofArizona,DepartmentofPublicHealth;4UniversityofArizona,DepartmentofPharmacology;5UniversityofArizonaCenterfor

InnovationinBrainScience;6UniversityofArizonaCenterforBiomedicalInformaticsandBiostatistics

Han,JialiAlzheimer'sDisease(AD)isthemostprevalentneurodegenerativedisorderaffecting

approximately50millionpeopleworldwide.Genome-wideassociationstudies(GWAS)haveidentifiedhundredsofsinglenucleotidepolymorphisms(SNPs)associatedwithAD,whiletheeffectsizeofeachindividualSNPislargelymodest.Themolecularmechanismsunderlyingtheseassociationsareyettobeunderstood.OurrecentgenomicanalysisfocusedonunveilingcommondownstreambiologicaleffectorsofintergenicSNPsassociatedwithAD,aimingtounderstandtheinteractive-andsynergeticeffectsthatthegeneticvariantsacrossnon-codingandintergenicregionsareplayinginthepathogenesisofAD.Inthisstudy,datafromGWASandexpressionquantitativetraitlocus(eQTL)studiesbyGTExprojectareintegrated,anddownstreamfunctionalsimilaritybetweentwoSNPsisimputedusinganenhancedmultiscaleinformationtheoreticdistancemodel[1].ThesignificancelevelsaredeterminedthroughextensivepermutationsoftheeQTL-derivedmultiscalenetworkformRNAoverlap,functionalsimilarityandsharedbiologicalprocesses[2].ConvergentmolecularmechanismsbasedongeneontologyareprioritizedatFDR<0.05.

TheprioritizedmechanisticnetworkforADrendersseveralfunctionalmodulesperturbedbyeithercis-eQTLortrans-eQTLelements,correspondingtomultiplecommonmechanismsdownstreamofdistincteQTLswithsomeofthembeingcross-chromosome.Forinstance,SNPsonchromosomessixandonearebothassociatedwithantigenprocessingandpresentationviaregulatingmultiplehumanleukocyteantigengenes(e.g.,HLA-DRB1andHLA-DQA1)andcytokinegenes,suggestingthegeneticinvolvementoftheimmunesystemsandneuroinflammationinthepathogenesisofAD.SNPsonchromosome17andchromosome19co-regulategenesinvolvedinsynaptictransmission,whichisessentialforneuronscommunicationanditsdysfunctionisknowninADleadingtomemoryloss.Otherthancross-chromosomeSNPs,independentintergenicSNPsonthesamechromosomealsoprovideinsightstoADgeneticrisks.ApairofSNPsonchromosome17isprioritizedbyourmethodthroughtheirconvergentassociationwiththeMAPTgene,whichencodestauprotein,regulatesaxonextension,andisknownasariskfactorofavarietyofneurodegenerativedisordersincludingnotonlyADbutalsoFrontotemporaldementiaandParkinsondisease.AnotherpairofSNPsonchromosome19isprioritizedbytheircommonassociationwiththeABCA7gene,whichregulateslipidmetabolismacrosscellularmembranesandissuggestedtobesusceptiblelociforthelate-onsetAD.

ThisstudysuggestsanewstrategyconnectingscatteredAD-susceptiblegeneticvariantswithriskgenesandconvergentdownstreammechanismsimplicatedinADpathogenesis.TheresultswillhelptounderstandhowgeneticvariantsandunderlyingfunctionalmodulesworkinteractivelyandsystematicallytowardADonsetandcouldthusidentifygenetics-specificmoleculartargetsandinspirenewpersonalizedtherapeuticstrategies.[1]Li,H.,etal.npjGenomicMedicine1:16006,2016.[2]Han,J.,etal.PSB,2018,pp.524-535.

54

IDENTIFICATIONANDEVALUATIONOFCO-EXPRESSIONGENENETWORKSFORPACLITAXEL-INDUCEDPERIPHERALNEUROPATHYINBREASTCANCER

SURVIVORS

KordM.Kober1,JonD.Levine2,JudyMastick1,BruceCooper1,StevenPaul1,ChristineMiaskowski1

1UCSFSchoolofNursing,2UCSFSchoolofMedicine

Kober,KordChronicchemotherapy-inducedperipheralneuropathy(CIPN)isthemostcommonandsevereadversedrugreactionassociatedwithneurotoxicchemotherapy(CTX)withprevalenceratesthatrangefrom30%to70%incancersurvivors.NopharmacologicinterventionsareavailabletopreventCIPN.LackofknowledgeofthefundamentalmechanismsthatunderlieCIPNthwartoureffortstodevelopinterventionstopreventortreatit.IncreasedknowledgeofCIPN'smolecularmechanismscouldidentifytherapeutictargetsforthiscondition.FindingsfromanimalstudiessuggestthatanumberofdiversemechanismsareinvolvedinthedevelopmentofchronicPIPNincludingdamagetoDRGcellbodies;microtubuleassociatedtoxicity;inflammation;distalaxonalinjury;damagetotheperipheralvasculature;modulationofionchannels;andmitochondrialdysfunction.TaxolisacommonCTXdrugthatisassociatedwiththedevelopmentofCIPN.Paclitaxel-inducedperipheralneuropathy(PIPN)isthedoselimitingtoxicityofthisCTXdrug.ThepurposeofthispilotstudywastoevaluateforcoordinatedexpressionvariationsofgenesinRNAextractedfromperipheralbloodfrombreastcancersurvivors,andfromthesemodulesidentifyco-expressedgenesthatareassociatedwithchronicPIPN.GeneexpressioninperipheralbloodwasassayedusingRNA-seqinasampleofbreastcancer(BC)survivorswhodid(n=25)anddidnot(n=25)developPIPN.BCsurvivorswithPIPNweresignificantlyolder;morelikelytobeunemployed;reportedloweralcoholuse;hadahigherBMIandapoorerfunctionalstatus;andhadahighernumberoflowerextremitysiteswithlossoflighttouch,cold,andpainsensations,andhighervibrationthresholds.NobetweengroupdifferenceswerefoundinthecumulativedoseofpaclitaxelreceivedorinthepercentageofpatientswhohadadosereductionordelayduetoPIPN.Co-expressionnetworkanalysiswasperformedtoidentifymodulesofgeneswithhighlycorrelatedexpressionusingthetop5000mostvariantgenes.Thirteencolor-codedmodulesweredetectedranginginsizefrom36to1653genes.Theeigengenesofthe"black"module(n=1653genes)weresignificantlycorrelatedwiththeCIPNphenotype(PearsonR2=0.224,p=0.02).GOenrichmentwasfoundininflammation-relatedterms(e.g.,C-Cchemokinereceptoractivity,Chemokine-mediatedsignalingpathway,Tcellco-stimulation).Functionalproteinassociationnetworkanalysisidentifiedanenrichmentofprotein-proteininteractions(p<0.0002)includinghighlyconnectedgenesthathavepreviouslybeenidentifiedtoberelatedtoCIPN(i.e.,Gprotein-coupledreceptor55,GPR55,andC-X-CMotifChemokineReceptor5,CXCR5).Toourknowledge,thisisthefirststudytoapplysystemsbiologyapproachesusingcirculatingbloodRNA-seqdatainasampleofbreastcancersurvivorswithandwithoutchronicPIPN.WerevealednetworksandcandidategenesassociatedwithchronicPIPNrelatedtoinflammation,andsuggestgenesforvalidationandaspotentialtherapeutictargets.

55

VARIFI-WEB-BASEDAUTOMATICVARIANTIDENTIFICATION,FILTERINGANDANNOTATIONOFAMPLICONSEQUENCINGDATA

MilicaKrunic1,PeterVenhuizen2,LeonhardMüllauer3,BettinaKaserer3,ArndtvonHaeseler1,4

1CenterforIntegrativeBioinformaticsVienna,MaxF.PerutzLaboratories,Universityof

Vienna,MedicalUniversityofVienna,Dr.Bohrgasse9,1030Vienna,Austria;2DepartmentofAppliedGeneticsundCellBiology,UniversityofNaturalResourcesandLifeSciences,Muthgasse18,1190Vienna,Austria;3InstituteofPathology,Medical

UniversityVienna,WähringerGürtel18-20,1090Vienna,Austria;4BioinformaticsandComputationalBiology,FacultyofComputerScience,UniversityofVienna,Vienna,

Austria

Krunic,MilicaFastandaffordablebenchtopsequencersarebecomingmoreimportantinimprovingpersonalizedmedicaltreatment.Still,distinguishinggeneticvariantsbetweenhealthyanddiseasedindividualsfromsequencingerrorsremainsachallenge.HerewepresentVARIFI,apipelineforfindingreliablegeneticvariants(SNPsandINDELs).WeoptimizedparametersinVARIFIbyanalyzingmorethan170ampliconsequencedcancersamplesproducedonthePersonalGenomeMachine(PGM).Incontrasttoexistingpipelines,VARIFIcombinesdifferentanalysismethodsand,basedontheirconcordance,assignsaconfidencescoretoeveryidentifiedvariant.Furthermore,VARIFIappliesvariantfiltersforbiasesassociatedwiththesequencingtechnologies(e.g.incorrectlycalledhomopolymer-associatedindelswithIonTorrent).VARIFIautomaticallyextractsvariantinformationfrompubliclyavailabledatabasesandincorporatesmethodsforvarianteffectprediction.VARIFIrequiresonlylittlecomputationalexperienceandnoin-housecomputepowersincetheanalysesaredoneonourserver.VARIFIisaweb-basedtoolavailableatvarifi.cibiv.univie.ac.at.

56

STATISTICALINFERENCERELIEF(STIR)FEATURESELECTION

TrangT.Le1,RyanJ.Urbanowicz1,JasonH.Moore1,BrettA.McKinney2

1InstituteofBiomedicalInformatics,DepartmentofBiostatistics,EpidemiologyandInformatics,UniversityofPennsylvania,Philadelphia,PA;2TandySchoolofComputer

Science,DepartmentofMathematics,UniversityofTulsa,Tulsa,OKLe,TrangMotivation:Identifyingrelevantfeaturesinhigh-dimensionaldatacanbechallengingwhentheireffectonanoutcomemaybeobscuredbyacomplexinteractionarchitecture.Usingnearestneighbors,Relief-basedalgorithmsaccountforstatisticalinteractionswhenselectingfeatures.However,Relief-basedestimatorsarenon-parametricinthestatisticalsensethattheydonothaveaparameterizedmodelwithanunderlyingprobabilitydistributionfortheestimator,makingitdifficulttodeterminethestatisticalsignificanceofRelief-basedattributeestimates.Thus,astatisticalinferentialformalismisneededtoavoidimposingarbitrarythresholdstoselectthemostimportantfeatures.Method:WereconceptualizetheRelief-basedfeatureselectionalgorithmtocreateanewfamilyofSTatisticalInferenceRelief(STIR)estimatorsthatretainstheabilitytoidentifyinteractionswhileincorporatingsamplevarianceofthenearestneighbordistancesintotheattributeimportanceestimation.ThisvariancepermitsthecalculationofstatisticalsignificanceoffeaturesandadjustmentformultipletestingofRelief-basedscores.Specifically,wedevelopapseudot-testversionofRelief-basedalgorithmsforcase-controldata.Results:WedemonstratethestatisticalpowerandcontroloftypeIerroroftheSTIRfamilyoffeatureselectionmethodsonapanelofsimulateddatathatexhibitspropertiesreflectedinrealgeneexpressiondata,includingmaineffectsandnetworkinteractioneffects.WeshowedthatthestatisticalperformanceusingSTIRp-valuesisthesameasusingpermutationp-valuesbutmuchmorecomputationallyefficient.WecomparetheperformanceofSTIRwhentheadaptiveradiusmethodisusedasthenearestneighborconstructorwithSTIRwhenthefixed-knearestneighborconstructorisused.ApplyingSTIRtorealRNA-Seqdatafromastudyofmajordepressivedisorder,wefoundthat32significantSTIRgenesincludeall8significantgenesfromstandardt-test.STIRgenesoutsideoftheintersectionwitht-testmaybegoodcandidatesforinteractioneffects.Conclusion:STIRisthefirstmethodtouseatheoreticaldistributiontocalculatethestatisticalsignificanceofReliefattributescoreswithoutthecomputationalexpenseofpermutation.ThisvalidatestheSTIRpseudot-testandmeansonecanuseitinsteadofcostlypermutationtesting.STIRformalismgeneralizestoallRelief-basedneighborfindingalgorithms,includingMultiSURF.k=m/6offersabetterdefaultthanthepervasiveuseofk=10,whichisanarbitrarychoiceintheearlyliterature.ExtensionsofSTIRwillinvolvemulti-classdata,quantitativetraitdata(regression)andcorrectionforcovariates.Similarly,weenvisionregression-STIRtofollowalinearmodelformalism.FuturestudieswillapplySTIRtoGWASaswellaseQTLandotherhighdimensionaldatatoidentifyinteractioneffects.

57

DEEPLEARNING-BASEDLONGITUDINALHETEROGENEOUSDATAINTEGRATIONFRAMEWORKFORAD-RELEVANTFEATUREEXTRACTION

GaramLee1,KwangsikNho2,ByungkonKang1,Kyung-AhSohn1,DokyoonKim3

1AjouUniversity,2IndianaUniversitySchoolofMedicine,3Geisinger

Kim,DokyoonAlzheimer'sdisease(AD)isaprogressiveneurodegenerativeconditionmarkedbyadeclineincognitivefunctionswithnovalidateddiseasemodifyingtreatment.ItiscriticalfortimelytreatmenttodetectADinitsearlierstagebeforeclinicalmanifestation.Mildcognitiveimpairment(MCI)isanintermediatestagebetweencognitivelynormalolderadultsandAD.TopredictconversionfromMCItoprobableAD,weappliedadeeplearningapproach,multimodalrecurrentneuralnetwork.Wedevelopedanintegrativeframeworkthatcombinesnotonlycross-sectionalneuroimagingbiomarkersatbaselinebutalsolongitudinalcerebrospinalfluid(CSF)andcognitiveperformancebiomarkersobtainedfromtheAlzheimer'sDiseaseNeuroimagingInitiativecohort(ADNI).Theproposedframeworkintegratedlongitudinalmulti-domaindatawithmissingvalues.ThepythonpackageLIFAD(Deeplearning-basedLongitudinalheterogeneousdataIntegrationFrameworkforAD-relevantfeatureextraction)providespre-constructeddeeplearningarchitectureforaclassificationtask.Ourresultsshowedthat1)ourpredictionmodelforMCIconversiontoADyieldedupto75%accuracy(areaunderthecurve(AUC)=0.83)whenusingonlyasinglemodalityofdataseparately;and2)ourpredictionmodelachievedthebestperformancewith80%accuracy(AUC=0.86)whenincorporatinglongitudinalmulti-domaindata.Amulti-modaldeeplearningapproachhaspotentialtoidentifypersonsatriskofdevelopingADwhomightbenefitmostfromaclinicaltrialorasastratificationapproachwithinclinicaltrials.

58

MICROBIOMEANALYSISOFUNEXPLAINEDCASESOFPNEUMONIAINSOUTHKOREA

SooyeonLim,JaeKyungLee,JiYunNoh,WooJooKim

DepartmentofInternalMedicine,GuroHospital,KoreaUniversityLim,SooyeonNasalswabsampleswereobtainedfrompatientswithsymptomsofpneumoniathroughthetertiaryhospital-basedinfluenzasurveillancesysteminSouthKoreaduring2011-2017.Althoughthesymptomsweresuspectedtobeofviralcausepneumonia,collectedsampleswereconfirmednegative,usingtherespiratoryviruspanel,for16commonrespiratorypathogens,inadditiontothefollowingfiveviruses:EnterovirusD68,WUpolyomavirus,KIpolyomavirus,Parechovirustype1,3,6,andPteropineorthoreovirus.Therefore,16SrRNAscreeningwasperformedtostudythemicrobiomecommunityofthepatients.V3andV4sequencesof16SrRNAwereobtainedusingNexteraXTDNAlibrarypreparationkitandMiSeqReagentKitv3(Illumina).Microbiomeprofilesof92patientsampleswereobtainedthroughIlluminaMiSeq.Thetotaltaxonomiccompositionofthesamplesconsistedof99bacterialgenus,whosesequencesweredetectedinmorethan1%ofthesamples.Commonbacterialpathogenswerepresentaseithersinglepathogenorincombinationwithotherorganismsinthepatientsamples.Althoughsamplescollectedweredifferentinconditions,suchasage,gender,location,andseason,commondominantgenusofbacteriacommonlyknownaspathogenswererevealed.Themostdominantgeneraofbacteriawerethefollowing:Streptococcus,Corynebactierum,Haemophilus,Rhizobium.Basedoncomparativeanalysisofgenuscompositionsaresimilarbutdemonstratedthedifferenceinmicrobialcompositionbetweenagegroups.Wetriedtoisolationdominantcoloniesthroughthemediacultureforwholegenomesequencingandisolatedsinglecolonyand8speciesareidentifiedusingsangersequencing.Aftermoreisolationofsinglecolonies,wewillfocusedonwholegenomesequencingtofindoutreasonofpneumoniasymptomsindetail.

59

POTRA:PATHWAYANALYSISOFCANCERGENOMICSDATAINTHECLOUD

MargaretLinan1,2,JunwenWang1,2,ValentinDinu1,2

1DepartmentofBiomedicalInformatics,ArizonaStateUniversity,Scottsdale,Arizona,

USA;2DepartmentofHealthSciences,MayoClinic,Scottsdale,Arizona,USA

Dinu,ValentinWehaverecentlydevelopedPoTRA(PathwaysofTopologicalRankAnalysis),anovelalgorithmthatusestheGoogleSearchPageRankalgorithmtoidentifybiologicalpathwaysinvolvedincancer.Theanalyticalapproachismotivatedbytheobservationthatlossofconnectivityisacommontopologicaltraitofgeneregulatorynetworksincancer.WeleveragedtheCancerGenomicsCloudenvironmentandappliedPoTRAtoanalyzeTheCancerGenomeAtlas(TCGA)genomicdata,ahigh-qualitypubliclyavailabledatasetoftumorandmatchednormalsamples.Thetopmostinfluentialpathwaysandmostdysregulatedpathwaysin17TCGAprojectswerefound,usingtheKEGG(KyotoEncyclopediaofGenesandGenomes)pathwaydatabase.Overall,pathwaysincanceristhemostcommondysregulatedpathway,andtheMAPKsignalingpathwayisthemostinfluential,whilethepurinemetabolismpathwayisthemostsignificantlydysregulatedmetabolicpathway.Additionally,genomicanalysisworkflowswerecreatedusingdockerandrabixforthedetectionofmRNAmediateddysregulatedpathwaysintheopenaccessTCGArepositorywiththePoTRAtoolintheCGCplatform.Ourapproachillustratestheadvantagesofemployingpowerfulcomputationalmethodstoanalyzelargegenomicdatasetswiththeaimofimprovingourunderstandingofcancerandidentifyingbetterdiagnosesandtreatments.

60

EVALUATINGCELLLINESASMODELSFORMETASTATICCANCERTHROUGHINTEGRATIVEANALYSISOFOPENGENOMICDATA

KeLiu1,PatrickA.Newbury1,BenjaminS.Glicksberg2,WilliamZeng2,EranR.Andrechek3,BinChen1

1DepartmentofPediatricsandHumanDevelopment,CollegeofHumanMedicine,MichiganStateUniversity,GrandRapids,MI,USA;2BakarComputationalHealthSciencesInstitute,UniversityofCaliforniaSanFrancisco,SanFrancisco,CA,USA;

3DepartmentofPhysiology,MichiganStateUniversity,EastLansing,MI,USAChen,BinMetastasisisthemostcommoncauseofcancer-relateddeathand,assuch,thereisanurgentneedtodiscovernewtherapiestotreatmetastasizedcancers.Cancercelllinesarewidely-usedmodelstostudycancerbiologyandtestdrugcandidates.However,itisstillunknowntowhatextenttheyadequatelyresemblethediseaseinpatients.Therecentaccumulationoflarge-scalegenomicdataincelllines,mousemodels,andpatienttissuesamplesprovidesanunprecedentedopportunitytoevaluatethesuitabilityofcelllinesformetastaticcancerresearch.Inthiswork,weusedbreastcancerasacasestudy.Thecomprehensivecomparisonofthegeneticprofilesof57breastcancercelllineswiththoseofmetastaticbreastcancersamplesrevealedsubstantialgeneticdifferences.Inaddition,weidentifiedcelllinesthatmorecloselyresembledifferentsubtypesofmetastaticbreastcancer.Surprisingly,acombinedanalysisofmutation,copynumbervariationandgeneexpressiondatasuggestedthatMDA-MB-231,themostcommonlyusedtriplenegativecelllineformetastaticbreastcancerresearch,hadlittlegenomicsimilaritywithBasal-likemetastaticbreastcancersamples.Wefurthercomparedcelllineswithorganoids,anewtypeofpreclinicalmodelwhicharebecomingmorepopularinrecentyears.Wefoundthatorganoidsoutperformedcelllinesinresemblingthetranscriptomeofmetastaticbreastcancersamples.However,additionaldifferentialexpressionanalysissuggestedthatbothtypesofmodelscouldnotmimictheeffectsoftumormicroenvironmentandmeanwhilehadtheirownbiastowardsmodelingspecificbiologicalprocesses.Ourworkprovidesaguideofcelllineselectioninmetastasis-relatedstudyandshedslightonthepotentialoforganoidsintranslationalresearch.

61

PATHWAYANALYSISOFEHRANDNON-EHR-BASEDGWASCONNECTSLIPIDMETABOLISMTOTHEIMMUNERESPONSE

JasonE.Miller1,ThomasJ.Hoffmann2,3,ElizabethTheusch4,CarlosIribarren5,MarisaW.Medina4,NeilRisch2,3,5,RonaldM.Krauss4,MarylynD.Ritchie1

1DepartmentofGenetics,UniversityofPennsylvania,Philadelphia,PA,USA;2Institutefor

HumanGenetics,UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,USA;3DepartmentofEpidemiologyandBiostatistics,UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,USA;4Children’sHospitalOaklandResearchInstitute,Oakland,CA,USA;5DivisionofResearch,KaiserPermanente,NorthernCalifornia,Oakland,CA,USA

Miller,JasonPathway-analysisisacommonlyusedmethodtointerpretgenome-wideassociationstudy(GWAS)results.Recentlyithasbeenillustratedthatelectronic-health-record(EHR)datafromasingle-cohortcanbeusedtoperformGWAS.However,itisunclearhowthisnewstudydesignmightaffectreplicationofpathway-levelresultswhencomparedtoanon-EHR-basedGWAS.ItisalsounclearhowanEHR-basedstudywillaffectdownstreamanalysessuchastheidentificationofgenesthatareassociatedwithsaidpathways.Weproposeevaluatingthepathway-levelsimilaritiesfromanalysesoftwoseparateGWASstudiesthatuseddifferentmethodologiestoinvestigatethesametraits.Here,weemploythesoftwarePARIS(PathwayAnalysisbyRandomizationIncorporatingStructure)tocomparesummary-levelresultsacrossstudies,thusmakingitmoregeneralizable.PARISgeneratesrandomizedcollectionsoffeatureswhichmimicpathwaystocalculateempiricalp-values.ThisprocessreducestypeIerrorandthemultipletestingburden.WecomparedEHRtonon-EHR-basedGWASresultsusingfourdifferentlipidtraits:low-densitylipoprotein(LDL),high-densitylipoprotein(HDL),triglycerides(TG),andtotalcholesterol(TC).ThedatacamefromtwoGWAS,theGeneticEpidemiologyResourceonAdultHealthandAging(GERA),asingle-cohortEHR-basedGWAS,andtheGlobalLipidsGeneticsConsortium(GLGC),whichusedameta-analysisstudydesign.KEGGpathwaysexpectedtoexplainvariationinlipidvaluessuchas"cholesterolmetabolism"and"PPARsignalingpathway"wereidentifiedfrombothstudies.Moreover,therewasasignificantoverlapbetweenthepathwaysidentifiedbetweenstudiesforthesametraits(p<1x10^-14).Thus,specificpathwayscanbereplicatedacrossdistinctcohortsandstudydesigns.Severalpathwaysmadeupofgeneswhoseproteinsareimportantforanimmuneresponsewereidentifiedinbothdatasetsandacrossmultiplelipidtraits.Toseeiflipidmodifyingtherapyaffectsthesamepathwaysofinterest,weperformedpathwayanalysisofCAPRNA-seqexpressionfromTheusch,E.,etal.,2016,whichmeasuredexpressioninimmortalizedcellspreandpoststatinexposure.AmongthepathwaysrepresentedinboththePARISresults(p<0.01)andLCLRNA-seqgenesetenrichmentresults(FDR<25%)are"cholesterolmetabolism"(CM)and"HepatitisC"(HC)pathways.HepatitisCvirus(HCV)infectioncancausechronicliverdiseaseandisassociatedwithahostoflipidandlipoproteinmetabolicdisorders.PARIScanalsoidentifygenesthatarestatisticallysignificantwithineachpathway.Interestingly,inboththeLDLandTCGWAS,thegenethatwassignificantlyassociatedwithboththeCMandHCpathwayswaslowdensitylipoproteinreceptororLDLR,agenethataffectsbothlipidmetabolismandHCviralactivity.Statins,incombinationwithothertherapies,canincreaseefficacyofantiviraltherapybyblockingviralreplication.OurresultshighlighttheneedforfurtherinvestigationintohowgeneticvariationaffectsoutcomesfromthetreatmentofHCVwithstatins,particularlywithrespecttolociassociatedwithlipidtraits.Inconclusion,pathway-levelanalysisofGWASsummary-levelresultscanbeusedtocharacterizesimilaritiesacrossEHRandnon-EHR-basedstudiesandimprovebiologicalinterpretation.

62

META-ANALYSISOFHETEROGENEITYANDBATCHEFFECTSINTHEA549CELLLINE

AbigailMoore,JohnCastorino

SchoolofNaturalSciences,HampshireCollege,Amherst,MAMoore,AbigailMeta-analysisofRNA-seqdataofferstheopportunitytoincreasereproducibilitybyintegratingdatafrommultiplestudies.SuchanalysesarechallengedbyheterogenouscellcultureandRNA-seqtechniques,whichmayconfoundorhidetruebiologicalfindings.Thus,wesoughttoidentifybatchcharacteristicsthatmostsignificantlyaffectgeneexpressioninacelllinecommontolungcancerandviralstudies.WequeriedtheNCBIGEOforRNA-seqdatafromtheA549celllineandfilteredtheresultsforpaired-enddataobtainedviatotalRNAextraction.Acrosseightstudies,wedownloadedrawRNA-seqdatafor23untreatedsamplesandcollectedcorrespondingmetadata.DifferentialexpressionanalysiswithSalmon,TXimportandedgeRidentified3,802differentiallyexpressedgenes(atleasttwofold-change,FDR<0.05).Principalvariantcomponentanalysisrevealedthatmediachoicealoneexplains54%ofexpressionvariationwithin139differentiallyexpressedlungcancerprognosticgenes.Ourfindingshighlighttheimpactofspecificbatcheffectsonbiologicallysignificantgenes.Infuturework,hopetoextendthisanalysistoconsidersinglenucleotidevariants.

63

HYPERPARAMETERTUNINGFORCHIP-SEQPEAKCALLINGSOFTWARETOOLSUSINGPARALLELIZEDBAYESIANOPTIMIZATION.

DongpinOh,JinheeLee,SeonghyeonKim,DohyeonLee,DongwonChoo,GiltaeSong

SchoolofComputerScienceandEngineering,PusanNationalUniversityLee,JinheeChIP-Seqiswidelyusedtounderstandprotein-DNAinteractionandgeneregulation.InChiP-seqdataanalysis,identifyingpeaksignalsisoneofcorecomputationalsteps,butmostexistingsoftwaretoolsstillsufferfromlargeportionoffalsepositivecallsowingtosequencingerrorsandbias,inpartcausedbycopynumbervariations.ChiP-seqanalysistoolsrequirehyperparameterssetbyusersdependingonsequencingqualityandcopynumbervariationrate.However,itishardforuserstoknowthevalidvaluesofthehyperparametersbeforerunningthesoftwaretools.Inaddition,wewouldhavemorefalsepositivepeakcallsforgivenChiP-seqdataifthehyperparametersofpeakcallingtoolsarelessthanoptimal.Inthisstudy,wedevelopasoftwarepipelineforidentifyingtheoptimalvaluesofthehyperparametersinmajorChiP-seqpeakcallingtools.FirstwecollectChiP-seqdatawhosepeaksignalsarelabeledmanuallybyexperts.Thesedataareusedastrainingdatainourhyperparametertuning.Secondwedefineanobjectivefunctiontomeasuretheaccuracyofpeakcallingresults.ThenwelearnoptimalhyperparametersusingthesetrainingdataandobjectivefunctionbasedonBayesianoptimization.WeuseMatern5/2kernelfunctionfortheoptimizationandMonteCarloMarkovChainforparallelprocessing.WevalidateourapproachusingourcollectionofChiP-seqdatalabeledforaround2,000genomicsegmentsincludingpeaksornopeaks.WeapplyoursoftwarepipelineformajorChiP-seqpeakcallingtoolssuchasMACS,SICER,HOMER,andPeakSeq.

64

CROSS-STUDYMETA-ANALYSISIDENTIFIESALTEREDBACTERIALSTRAINSSEPARATINGRESPONDERANDNON-RESPONDERPOPULATIONSACROSS

MULTIPLECHECKPOINT-INHIBITORTHERAPYDATASETS

JayamaryDivyaRavichandar,EricaRutherford,YongganWu,ThomasWeinmaier,Cheryl-EmilianeChow,ShokoIwai,HelenaKiefel,KareemGraham,KarimDabbagh,

ToddDeSantis

SecondGenomeRavichandar,JayamaryDivyaThegutmicrobiotahasemergedasanimportantmodulatorincancerprogressionandagrowingbodyofevidencesupportstheinfluenceofgutmicrobiotaonresponsetocancertherapy,especiallyinthecontextofcheckpointinhibitortherapy.Whileseveralstudiespresentinsightintothelandscapeofmicrobialshiftsmodulatingresponsetocheckpointinhibitors,theymaybeundulyinfluencedbycohort,sequencing-technology,anddataanalysismethods.Further,individualstudiesareoftenunder-poweredtodetectmicrobesdifferentiallyabundantinresponderandnon-responderpopulations,whichcanlimittherapeuticdevelopment.Keytomicrobiome-baseddrugdiscoveryistheidentificationofproteinswiththerapeuticpotentialthatareefficaciousacrosscohorts.Herein,existingpublisheddatasetsinthecheckpoint-inhibitorspacewereminedandintegratedviaacross-studymeta-analysistoidentifybacterialstrainsseparatingresponderandnon-responderpopulations.Wecomparedthebaselinegutmicrobiotaassociatedwithstoolsamplescollectedfromfivediscretecancerpatientcohortsundergoingcheckpoint-inhibitortherapy.Samplesweresequencedononeormoretechnologies(Illumina16SNGS,45416SNGS,andIlluminashotgunmetagenomics)andatotalofsevenpublicly-availabledatasetswereanalyzedherein.Leveragingourmulti-facetedbioinformaticsplatform,whichenablesappropriatemethod-specificqualityfilteringandstatisticaltestingtoidentifydifferentiallyabundantbacteriaatthestrain-level,wewereabletosuccessfullyintegrateanalysisresultsacrossmultiplemicrobiome-profilingtechnologies.Weperformedarandomeffectsmodelbasedmeta-analysisandidentifiedstrainsthatwereconcordantlyenrichedinresponderpopulationsacrossdatasets.Inaseparateanalysiswealsoappliednaturallanguageprocessingtothetextofcancercheckpointinhibitorstudies(availableinPubmed)inordertoobtainadditionalinsightsaboutthemicrobiomeandstrainsofinterestfrompublicationswithnorawdataavailable.Thestrainsidentifiedhereinpresentopportunitiesforminingproteinswithpotentialtoimproveresponsetocheckpointinhibitors.Thiscrossstudymeta-analysisdemonstratesthepowerofSecondGenome'sbioinformaticspipelinetoleveragepubliclyavailabledatasetsandsystematicallyintegratemicrobialshiftsnotonlyacrosssamplesfrommultiplecohortsbutalsoacrosssamplessequencedondifferenttechnologies.Ourin-housestraindatabasethatenablestaxonomicannotationdowntothestrain-levelallowedforcomparisonoffine-grainedbacterialidentitiesacrossdatasets,resolvingakeychallengewithmicrobiomemeta-analysis.Thissystematicandstatistically-drivenintegrationofdatasetsenabledidentificationofstrainsassociatedwithresponseacrossmultipleresponderpopulationsthatwerenotpreviouslyreportedintheindependentanalysisofthesedatasets.

65

AHYPOTHESISOFTHESTABILIZINGROLEOFALUEXPANSIONVIAHOMOLOGYDIRECTEDREPAIROFSPONTANEOUSDNADOUBLESTRANDEDBREAKS

TanmoyRoychowdhury,AlexejAbyzov

MayoClinicAbyzov,AlexejStructuralvariations(SVs)inthehumangenomeoriginatefromdifferentmechanismsrelatedtoDNArepair,replicationerrors,andretrotransposition.Ouranalysesof26,927SVsfromthe1000GenomesProjectrevealeddifferentialdistributionsandconsequencesofSVsofdifferentorigin,e.g.,deletionsfromnon-allelichomologousrecombination(NAHR)aremorepronetodisruptchromatinorganizationwhileprocessedpseudogenescancreateaccessiblechromatin.Spontaneousdoublestrandedbreaks(DSBs)arethebestpredictorofenrichmentofNAHRdeletionsinopenchromatin.Thisevidence,alongwithstrongphysicalinteractionofNAHRbreakpointsbelongingtothesamedeletionsuggeststhatmajorityofNAHRdeletionsarenon-meiotici.e.,originatefromerrorsduringhomologydirectedrepair(HDR)ofspontaneousDSBs.Inturn,theoriginofthespontaneousDSBsisassociatedwithtranscriptionfactorbindinginaccessiblechromatinrevealingthevulnerabilityoffunctional,openchromatin.Thechromatinitselfisenrichedwithrepeats,particularlyAluelementsthatprovidethehomologyrequiredtomaintainstabilityviaHDR.Additionally,weobservedastrikingdifferencebetweendistributionsoffixedandvariableAlusacrossgenomecompartments.Throughco-localizationoffixedAlusandNAHRdeletionsinopenchromatinwehypothesizethatoldAluexpansioninhominidlineagehadastabilizingroleonthehumangenome.

66

STATISTICALLEARNINGWITHHIGH-DIMENSIONALMASSCYTOMETRYDATA

PratyaydiptaRudra1,ElenaHsieh2,DebashisGhosh2

1OklahomaStateUniversity,2UniversityofColoradoDenver

Rudra,PratyaydiptaRecentdevelopmentsinsingle-cellbasedtechnologies,suchasmasscytometry(CyTOF),hasledtotheneedforcomputationalandanalyticapproachesthatcanaccommodatethehighdimensionalityandsingle-cellgranularity.TheanalysisofCyTOFdatacanelucidatenoveldiseasebiomarkersandmechanismsoftheunderlyingimmunopathology,leadingtoimprovedtreatmentsandprognosticmeasures.Theuseofsingle-celltechnologiesallowsforconsiderationofexpressionfrombothaspatialandtemporalframework.Inspiteofthepromisingnatureoftheseplatforms,muchworkremainsinordertobeabletomeaningfullyinterpretthedatainthecontextofbiologicalquestions.Whileend-to-endreproduciblemethodsexistforfluorescenceflowcytometrydataanalysis,theydonotscalewellforCyTOFdatawhichhavemuchhigherdimensionality.Thedataareoftenclusteredintocellsub-populationsfirst,whichcanthenbeusedtoanswerscientificquestionsregardingtheabundanceofcelltypesandexpressionsofspecificparameters(e.g.surfacemarkers,signalingproteins,cytokines)acrossgroups,suchasdiseaseandcontrolgroups,orstimulationregimes.Thestatisticalquestionsaboutthetree-structuredcellpopulationdatacanbevisualizedintwolayers.First,itisclinicallyinterestingtoknowiftheabundanceofthecellsubpopulationsisdifferentacrosstwoormoregroupsand/orconditions.Giventheproportionofcelltypesforeachsample,thenextquestioniswhetherthereisanydifferentialexpressionofsignalingproteinsorcytokines(functionalmeasurementsofthecellpopulationsstudied).Modelingdatawithmultiplelayersofcorrelationusingaclassicalparametricmodeloftenbecomesachallengingtask.Theclassicalparametricmodelsalsohavelimitingdistributionalassumptionssuchasnormality,whichmaynotbetrueforcytometrydata.Inordertotacklethis,wedevelopedanewstatisticallearningmethodologybasedonthekerneldistancecovarianceframeworktocomparethecelltypecompositiondifferentdiseasegroupsandstimulationconditions.High-dimensionalstatisticallearningusingakernelmachineregressionisalsodevelopedtotestthedifferenceincytokineexpressionlevelsacrossdifferentcell-typesanddifferentconditions.Themethodsareappliedtohigh-dimensionaldatasetwecollectedcontainingdifferentsubgroupsofpopulationsincludingSystemicLupusErythematosuspatientsandhealthycontrolsubjects.Thesamplesfromtheperipheralbloodofthesubjectsweretreatedusingthreedifferentstimulationmethods.Preliminaryanalysisofthedatarevealedclinicallyrelevantpatternssuchasdifferentialcelltypeabundancebetweenthediseaseandthecontrolgroup,andalsodifferentialexpressionofseveralcytokines.Forexample,theexpressionofthecytokinesMCP1,Mip1bandIL-1RAwerefoundtobedifferentamongCD14highmonocytesacrossthetwogroups.Anextensivesimulationstudytocompareourstatisticalmethodwiththeexistingapproachesiscurrentlybeingconducted.

67

HARDWAREACCELERATIONOFAPPROXIMATESTRINGMATCHINGFORBOTHSHORTANDLONGREADMAPPING

DamlaSenolCali1,LavanyaSubramanian2,ZülalBingöl3,JeremieS.Kim1,4,RachataAusavarungnirun1,AnantV.Nori2,GurpreetS.Kalsi2,SreenivasSubramoney2,Saugata

Ghose1,CanAlkan3,OnurMutlu1,4

1CarnegieMellonUniversity,2IntelLabs,3BilkentUniversity,4ETHZurichKim,JeremieHighthroughputsequencing(HTS)technologyenablesfastandinexpensivegenerationofbillionsofDNAsequences(i.e.,reads)fromagenome[1,2].Toquicklyandaccuratelyprocesstheplethoraofreads,weneednewcomputationaltechniques.AnalyzingHTSdatarequiresfindingtheoriginallocationsofeachreadviaanapproximatestringmatchingprocessagainstalongreferencegenome.Approximatestringmatchingistypicallyperformedwithanexpensivedynamicprogrammingalgorithm,whichconsumesover90%ofthefirststep'sexecutiontime.Manypriorstudies[3,4]haveidentifiedthisbottleneckinmappingandhaveproposednumerousmethodsforacceleratingthisexpensivesteponawide-arrayofcomputationalplatforms.Ourgoalinthisworkistoprovideafastandefficientimplementationofapproximatestringmatchingtowardsenablingfasterreadmapping.WechoosetoaccelerateBitap[6,7]duetoitsabilitytoperformapproximatestringmatchingwithfastandsimplebitwiseoperations,thatcanbehighlyparallelizedforhighthroughput.Wemodifiedthealgorithmtoenablesearchinglongerpatternsandtoremovethedatadependencybetweentheiterationsandprovideparallelismforthelargeamountofiterations.Unfortunately,inourstudyofBitaponexistingsystems,wefindthatCPUsandGPUsalonearebothlimitedbytheirrespectivearchitecturesandthuscannotfullyutilizetheavailablehardwareformaximalefficiency.Specifically,wefindthattheCPUimplementationofBitapisbottleneckedbycomputationsincetheworkingsetfitswithintheL1cacheandthelimitednumberofcorespreventsthefurtherparallelspeedup.TheGPUimplementationofBitapisbottleneckedbylimitedamountofprivatememoryanddestructiveinterferenceofthreadswhileaccessingthesharedmemory.Inordertoovercometheimbalanceineachoftheabovesystems,weproposeacustomacceleratorforBitapwithcharacteristicsthatfallsbetweentheCPUandGPU.Thisachievesafinerbalanceincomputeresourcesandmemoryforhigherperformanceinapproximatestringmatching.Wealsoexplorethedesignspaceofvariousaccelerators,includingprocessinginmemory.REFERENCES[1]Alkan,Can,etal."LimitationsofNext-generationGenomeSequenceAssembly,"NatureMethods,2011.[2]VanDijk,ErwinL.,etal."TenYearsofNext-generationSequencingTechnology,"TrendsinGenetics,2014.[3]Alser,Mohammed,etal."GateKeeper:ANewHardwareArchitectureforAcceleratingPre-alignmentinDNAShortReadMapping,"Bioinformatics,2017.[4]Kim,JeremieS.,etal."GRIM-Filter:FastSeedLocationFilteringinDNAReadMappingusingProcessing-in-MemoryTechnologies,"BMCGenomics,2018.[5]Baeza-Yates,Ricardo,etal."ANewApproachtoTextSearching,"CommunicationsoftheACM,1992.[6]Wu,Sun,etal."FastTextSearchAllowingErrors."CommunicationsoftheACM,1992.

68

TRANSITIONOFREGULATORYFORCETOWARDTHEGENEEXPRESSIONSDURINGOSTEOBLASTCELLDIFFERENTIATION

YoichiTakenaka

KansaiUniversityTakenaka,YoichiUnderstandingthedynamicsofcelldifferentiationsystemisoneofthebigissueinbiologyandmedicine.IthelpstoacquirecellsofdiseasedorgansfrompluripotentstemcellssuchasEScellofiPScell.Toanalyzethedynamics,thetime-seriesgeneexpressionprofilesofcelllinesfromvariousorganismshavebeenmeasured.Ithasbeenmadeclearthatthemovementofthegeneexpression.However,thedynamicsofthesystemsuchasgeneregulationmechanismarenotwellrevealedyet.

Intheposter,theauthorshowsthedynamicsofgeneregulationsduringtheosteoblastcelldifferentiationprocessfrommesenchymalstemcell.Therearemanygenesandgeneregulationsthatareknowntobeactiveduringtheprocess.However,ithasbeennotreportedthattheactivitytimeofeachregulationandthestrengthoftheactivatedregulations.Theauthorproposedamethodtoelucidatethetransitionsbetweentheactivationandinactivationofgeneregulationsatthetemporalresolutionofsingletimepoints.Themethodmeasuresthestrengthofthegeneregulationsofeachtimepointbyleaveone-time-pointoutway.Thenitdecomposesthetimeseriesofthegeneexpressiondataintopartialseriesusinginformationcriterion.Finally,itdetermineswhethereachgeneregulationofeachpartialtimeseriesisactivatedorinactivated.

Thegeneexpressionprofileoftheosteoblastcelldifferentiationprocessincludes65timepointsrangedfromminus6hourto192hourwhere0houristhetimethecelldifferentiationprocessstarts.TheprofilewasdownloadedfromGenomeNetworkPlatformofNationalInstituteofGenetics,Japan.Thegeneregulatorynetworkthatisactivatedatleastonetimepointduringthedifferentiationwascomposedfromthreereviewedpapers.Itincludes19genesand22regulationswhereRunx2,thekeytranscriptionfactorassociatedwithosteoblastdifferentiation,islocatedatthecenterofthenetwork.Osx,transcriptionfactorSp7,whichservesasamakerforosteoblastdifferentiation,isatthedownstreamofRunx2.

Theresultshowstherearefourdistinctperiodsduringtheosteoblastcelldifferentiation.Andeachperiodindicateswhentheexpressionsofgenesarestronglycontrolled.

Beforethecelldifferentiationprocessstarts,Osx,BMP2,DLX5andHDAC3arethemoststronglycontrolledamongallthe65timepoints.Next,EP300iscontrolledstronglyatthefirstperiod.Then,Creb,HDAC3,HDAC4,HDAC5andOsxare.Andatthefinalperiod,Runx2,Bglap,DLX5,DHAC7andSMAD6are.Theanalysisgivesthehinttocontrolthecelldifferentiationprocess.

69

METHYLATIONPROFILESOFMELANOMATOPREDICTTILS

YihsuanTsai1,NanaNikolaishviliFeinberg1,KathleenConway2,SharonN.Edmiston1,NancyE.Thomas3,JoelS.Parker4

1LinebergerComprehensiveCancerCenter(LCCC),UniversityofNorthCarolinaatChapelHill;2DepartmentofEpidemiology,SchoolofPublicHealth,DepartmentofDermatology,SchoolofMedicine,LinebergerComprehensiveCancerCenter(LCCC),UniversityofNorthCarolinaatChapelHill;3DepartmentofDermatology,SchoolofMedicine,LinebergerComprehensiveCancerCenter(LCCC),UniversityofNorthCarolinaatChapelHill;

4LinebergerComprehensiveCancerCenter(LCCC),DepartmentofGenetics,SchoolofMedicine

Tsai,YihsuanCorrelationsbetweentumorinfiltratinglymphocytes(TILs)andprolongedsurvivalhave

beenreportedinmanycancersincludingmelanoma.However,currentTILassessmentbypathologistsreviewingtheslidesectionsisnotalwaysideal.Inter-observeragreementbetweenpathologistsmaybelowiftheassessmentwasquantitative.Toachieveahigheragreement,theestimatesmaybetranslatedtocategories.HereweproposedtotrainanepigenomicsmodeltoestimatetheT-cellpopulationsinmelanomasamplesusingimmunofluorescence(IF)imageofCD3andCD8T-cells,whichprovidesamoreobjectiveestimationofTILs.

Inpreviouswork,wegeneratedmethylationprofilesfor89melanomaand78nevisamples.TohaveagoldstandardofTILestimate,80outofthe89melanomasampleswerestainedwithIFtoimageCD3,CD8,S100(melanomamarker)andanuclearcounterstain.WedefinedthefractionofCD3and/orCD8positivecellsasT-cellfractionandfounditsestimatefromtheIFimagehasthemostsignificantassociationwithpatientsurvival.Therefore,anelasticnetmodelwasbuiltusingfeaturesfromthemethylationdatasetwithT-cellfractionestimatesfromtheIFimageasresponse.Monte-Carlocrossvalidationwasperformedon2/3ofthesamplestotunetheparameters.Weidentified121CpGsinthefinalmodeltoestimateT-cellfractionwhichgaveusthehighestcorrelationwithpearsonr=0.87invalidationandr=0.91inallsamples.Wealsocomparedthismethodwithtwoothermethods.Inanaïvemethod,weidentifiedCpGswithhighmethylationlevelinexternallymphocytesamplesandlowinournevisamples.Theseprobesrepresentalymphocytemethylationsignatureonanunmethylatednevibackground.Therefore,wecalculatedthemodeofkernel-smoothedDNAmethylationdistributionatthesesitesforeachsampleasasurrogateforlymphocytefractionforthatsample.ThismethodgaveacorrelationrelativetothegoldstandardofR=0.64.Anothermethodusesreference-basedcelldeconvolutionalgorithms,whereapre-builtmethylationreferencewasusedtocomputethefractionsofeachcelltypesviathreedifferentalgorithms.Whileallthreealgorithmsgavesimilarresults,RobustPartialCorrelations(RPC)providesthehighestcorrelationwiththegoldstandard(R=0.58).

Wethenappliedourfinalmodel(121methylationmarkers)toanexternaldataset,TCGA-SKCM,toestimatetheT-cellfractions.SincethereisnogoldstandardforTCGA-SKCMdataset,weusedsurvivalasasurrogate.WefoundtheT-cellfractionestimatefromourmodelhadastrongsurvivalassociation(coxp-value=3.85e-05).WewilllookatthecorrelationofourestimationwithexpressionofT-cellgenemodulesnext.

Insummary,thepredictedT-cellfractionfromourmethylationmarkershasveryhighcorrelationwiththeestimatesfromIFimagesandit'salsohighlycorrelatedwithpatientsurvival.

70

HIGH-THROUGHPUTGENETOKNOWLEDGEMAPPINGTHROUGHMASSIVEINTEGRATIONOFPUBLICSEQUENCINGDATA

BrianTsui,HannahCarter

DepartmentofMedicine,UniversityofCalifornia,SanDiegoTsui,BrianY.SequencingReadArchivecontainsmorethanonemillionrunsofpubliclyavailablesequencingdata.However,thelackofconsistentlypreprocessedsummaryandmolecularquantificationdata(forexample,geneexpressionquantificationforRNAseq)foreachsequencingrunhindersefficientBigDatainterpolation.Here,weintroduceSkymap,astandalonedatabasethatoffersasingle,multi-speciesdatamatrixincorporatingallpublicsequencingstudies.Thedatamatrixcontainsseveralomiclayers,includingexpressionquantification,allelicreadcounts,microbesreadcounts,chip-seq.Wereprocessedpetabytesofsequencingdatatogeneratethedatamatrixforeachdatatype.Wealsoofferareprocessedbiologicalmetadatafilethatdescribestherelationshipsbetweenthesequencingrunsandtheassociatedkeywords,extractedfromover3millionfreetextannotationsusingnaturallanguageprocessing.Theprocesseddatacanfitintoasingleharddrive(<500GB).Inhttps://github.com/brianyiktaktsui/Skymap,weshowcasehowonecan(1)retrieveandanalyzetheSNPsandexpressionofageneticvariantacross>250krunsinlessthanaminuteand(2)increasethetemporalresolutionfortrackinggeneexpressioninmousedevelopmentalhierarchy.

71

MANTA-RAE,PREDICTINGTHEIMPACTOFGENOMEVARIANTSONTHETRANSCRIPTIONFACTORBINDINGPOTENTIALOFREGULATORYELEMENTS

RobinvanderLee,PhillipA.Richmond,OriolFornes,WyethW.Wasserman

CentreforMolecularMedicineandTherapeutics-DepartmentofMedicalGenetics-BCChildren’sHospitalResearchInstitute-UniversityofBritishColumbia-Vancouver,

CanadavanderLee,RobinInterpretingthefunctionalimpactandpathogenicityofnoncodingvariantsremainschallenging.Increasingevidencesuggestsanimportantroleforalterationsthatimpactcis-regulatoryelementsandtranscriptionfactor(TF)bindingsites(TFBSs).WearedevelopingMANTA-RAE,atoolforMutationalANalysisofTfbsAlterationsbyReconstructionofAlteredregulatoryElements.MANTA-RAEwillpredicttheeffectsofvariantsonTFBSsinregulatoryelementsinathree-stepapproach:(i)reconstructingreferenceandalternativegenotypesbasedonuser-suppliedsetsofgenomicvariantsandregulatoryelements,(ii)predictingTFBSthroughsequencescanningwithcuratedTFbindingmodelsfromJASPAR,and(iii)deltaregulatorycapacityanalysisbycomparingtheTFBSpotentialofthereferenceandalternativesequences.MANTA-RAEwillhavethecapacitytoevaluate(i)bothlossesandgainsofTFBSsand(ii)changesbeyondsinglenucleotidevariants,includingsmallinsertions,deletions,andlargercopynumberchanges.Envisionedapplicationsincludeprioritizationofvariantsfromrarediseaseandcancergenomes.Thesefeaturesshouldcontributetoricherdetectionofregulation-alteringnoncodingvariantsthatmaycontributetodisease.

72

USINGQUANTITATIVEPHOSPHOPROTEOMICSTOUNDERSTANDFUNCTIONALSELECTIVITYOFRECEPTORTYROSINEKINASES

J.Watson,C.Francavilla,J.M.Schwartz

FacultyofBiology,MedicineandHealth,UniversityofManchesterWatson,JoanneCellsignallingistheprocessoftranslatingextracellularmessages,orsignals,totheinsideofthecellinordertocoordinatecellularactivity.Cellsreceivesignalsfromtheexternalenvironmentinamyriadofways,includingbythebindingofextracellularproteins,calledligands,toreceptorsonthesurfaceofthecell.Uponligandbinding,thesignalistransmittedacrossthecellsurfacebythereceptorandthesignalpropagatesthoughthecell,primarilybythepost-translationalmodificationofproteins.Forthereceptortyrosinekinase(RTK)family,thisprocessismediatedbyphosphorylation,amodificationwhichisaddedtoserine,threonineortyrosineresiduesofproteinsbytheactivityofkinasesandremovedbyphosphatases.Theadditionofphosphoryl-groupsisassociatedwithactivationofproteinfunction.LigandbindinginducesRTKdimerizationandactivationofkinaseactivity,allowingfullactivationofthereceptor.Thisinitiatesasequentialcascadeofproteinphosphorylation,ultimatelyregulatingtranscriptionfactoractivitytomodulatecellularbehavior.Anunansweredquestioninthefieldishowdifferentligandsbindingtothesamereceptorinducedistinctsignalingcascades,definedbychangesinphosphorylationdynamicsandconsequentcellularbehavior,aconceptknownasfunctionalselectivity.Thisisdemonstratedbyfibroblastgrowthfactor(FGF)-receptor2b;whenstimulatedbyeitherFGF7orFGF10anincreaseinproliferationormigrationrespectivelyisobserved.Quantitativephosphoproteomicsisapowerfulmethodforcomparingonaglobalscalethesignalingcascadesinducingthesedifferentbehaviors.Thiscomparisonwillallowustodefinepatternsofphosphorylationassociatedwithsignallingbydifferentligands,andusethistoidentifykeyphosphorylationsitesassociatedwithparticularcellbehaviors.Wehavedevelopedaworkflowtointerrogatetemporalphosphoproteomicsdatasetstodirectlycomparethephosphorylationdynamicsofcellsstimulatedbydifferentligands.Asproteinsmayhavemultiplephosphorylationsiteswhichcanhaveindependenteffectsandregulation,ourapproachconsidersdataonthelevelofboththephosphorylationsiteandassociatedprotein.Initialclusteringofphosphorylatedsiteswithsimilardynamicsovertimeisfollowedbyprotein-levelanalysisoffunctionalsimilarity,usingconnectivityingraphdatabases,enrichmentforontologicalterms,androlesinwell-studiedsignallingpathways(extractedfromKEGG).Subsequentstepsintheworkflowaimtomovetheanalysisfromtheproteintothephosphorylatedsites.Byintegratingnetwork-basedanalyseswithphosphoproteomicsdata,wewilldevelopnovelmethodsforunderstandingandvisualizingtheroleofphosphorylationinfunctionalselectivity.

73

ANERISAPPLIED:SPARK-ENABLEDANALYTICSFORFULL-SCALEANDREPRODUCIBLEANNOTATION-BASEDGENOMICSTUDIES

NicholasWheeler,JeremyFondran,PennyBenchek,JonathanHaines,WilliamS.Bush

CaseWesternReserveUniversityWheeler,NicholasModerngenomicstudiesarerapidlygrowinginscale,andtheanalyticalapproachesusedtoanalyzegenomicdataareincreasingincomplexity.Genomicdatamanagementposeslogisticandcomputationalchallenges,andanalysesareincreasinglyreliantongenomicannotationresourcesthatcreatetheirowndatamanagementandversioningissues.Asaresult,genomicdatasetsareincreasinglyhandledinwaysthatlimittherigorandreproducibilityofmanyanalyses.Inthiswork,wedescribeananalysisframeworkbasedonSparkinfrastructurethatprovidesmanagement,rapidaccess,andflexibleanalysisofgenomicdata.Bystoringlarge-scalegenomicandvariantannotationresourcesalongsidegenomicdatainadistributedsystem,weprovideefficientmethodsfortestingavarietyofbiologically-drivenhypothesesforrarevariants.Usingthewell-establishedSparkframeworkandanalysesdesignedusingJupyternotebooks,weprovidetoolsthatimproveprocessingspeed,reduceuser-drivendatapartitioning,andenhancethereproducibilityoflarge-scalegenomicstudies.

74

PUTTINGRELICANTHUSINITSPLACE:IMPACTOFMIXTUREMODELCHOICEONPHYLOGENETICRECONSTRUCTION

MadelyneXiao1,MercerR.Brugler2,EstefaniaRodriguez1

1DepartmentofInvertebrateZoology,AmericanMuseumofNaturalHistory,CentralParkWestat79thStreet,NewYork,NY10024;2BiologicalSciencesDepartmentNYC

CollegeofTechnology(CUNY),285JayStreet,Brooklyn,NY11201Xiao,MadelyneFirstdescribedin2006,Relicanthusdaphneaeisadeep-seaanthozoanthatlivesontheoceanfloornearhydrothermalventsintheEastPacific.Itwasoriginallyclassifiedasananemoneuntilaphylogeneticanalysisin2014calledthisclassificationintoquestion.ThetreeresultingfromamaximumlikelihoodanalysisfortheOrderActiniaria(anemones)placedRelicanthusoutsideofActiniaria;arecentanalysisofRelicanthus'mitochondrialgeneorder,however,suggestsitsmembershipamongtheanemones.Anongoingstudyseekstorelatethechoiceofmixturemodel(e.g.,maximumlikelihood,maximumparsimony,Bayesianinference)totheresultingphylogenetictree,takingintoaccounttherobustnessofthedatasetinquestion(numberofgenes,specimens,etc).Inparticular,weareinterestedintheimpactofmixturemodelchoiceontheplacementofRelicanthuswithrespecttotheactiniarians.

75

RATIONALDESIGNOFNOVELSKP2INHIBITORSUSINGDEEPNEURALNETWORKS

ShuxingZhang,BeibeiHuang,LonW.Fong

IntelligentMolecularDiscoveryLaboratory,DepartmentofExperimentalTherapeutics,MDAndersonCancerCenter,Houston,TX77054

Zhang,ShuxingRecentlyithasgainedmoreandmoreattentionwithdeeplearningtechniques,whichshowsignificantpromiseingeneratingpredictivemodelsforpharmaceuticalresearch.Inthepresentstudy,weattempttodevelopdeepneuralnetworksmethodtodesignnoveltherapeuticagentsfortriple-negativebreastcancer(TNBC)bytargetingacrucialE3ligaseSkp2.TNBCrepresentsabout20%ofbreast-cancercases.Itishighlyaggressivewithpoorclinicaloutcome,andnotargetedagentshavebeenshowntobeclinicallyeffectiveintreatingTNBC.Skp2isanF-boxprotein,constitutingoneofthefoursubunitsoftheSkp1-Cullin-1(Cul-1)-F-Box(SCF)ubiquitinE3ligasecomplex.EarlierstudiesshowedthatSkp2regulatescellcycleprogressionandproliferationbytargetingubiquitinationanddegradationofitssubstratessuchascellcycleinhibitorp27.Ourin-housedataalsorevealedthatSkp2wasoverexpressedinTNBCandcorrelatedwithpoorprognosis.Inaddition,werevealedthatgeneticSkp2inactivationalsotriggeredamassivecellularsenescenceand/orapoptosisresponseinap19Arf/p53-independent,butp27-dependentmanner.Takentogether,ourresultssuggestthattargetingSkp2mayrepresentageneral"pro-senescence/apoptosis"and"anti-glycolysis"approachandisapromisingtherapeuticstrategyforTNBCdevelopmentandmetastasis.Hereinwedevelopedanoveldeepneuralnetwork(DNN)methodtopredictTNBCcellresponsestodrugsbasedsolelyontheirchemicalfeatures.Inparticularacostfunctionwasemployedtosuppressoverfitting.Wealsoadoptedan"earlystopping"strategytofurtherreduceoverfitandimprovetheaccuracyofourmodels.Currentlythesoftwarehasbeenintegratedwithageneticalgorithm-basedvariableselectionapproachandimplementedaspartofourDL4DRpackage.WeobservedthatDL4DRcouldhandlebigdatasetefficiently,significantlyoutperformingothermethodsinmodel-buildingandpredictionandobtainingbetterresultsinbigdataanalysis.WhenemployedtopredictdrugresponsesofseveralhighlyaggressiveTNBCcelllines,DL4DRproducedrobustandaccuratepredictions.Therefore,weappliedtheseTNBCmodelstorationallydesignnewsmallmoleculeinhibitorsbytargetingSkp2.AfterscreeningofmillionsofchemicalcompoundsanddesigningnovelstructuresbasedonourleadcompoundZL25,weconductedaseriesofbiochemicalandcellularstudies.TheseexperimentalexaminationsdemonstratethatthetoprankedmoleculesindeedinhibitSkp2E3ubiquitinationfunctionssignificantlyandkillTNBCcellseffectively.HenceithasbeenusedforourleadoptimizationofSkp2inhibitors,andweanticipatethatDL4DRcanbeemployedasageneraltoolforhitidentificationandleadrationaldesignforcancertherapeuticsdevelopment.

76

PATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK

POSTERPRESENTATIONS

77

ODAL:AONE-SHOTDISTRIBUTEDALGORITHMTOPERFORMLOGISTICREGRESSIONSONELECTRONICHEALTHRECORDSDATAFROMMULTIPLE

CLINICALSITES

RuiDuan,MaryReginaBoland,JasonH.Moore,YongChen

DepartmentofBiostatistics,Epidemiology&Informatics,UniversityofPennsylvaniaChen,YongElectronicHealthRecords(EHR)containextensiveinformationonvarioushealthoutcomesandriskfactors,andthereforehavebeenbroadlyusedinhealthcareresearch.IntegratingEHRdatafrommultipleclinicalsitescanaccelerateknowledgediscoveryandriskpredictionbyprovidingalargersamplesizeinamoregeneralpopulationwhichpotentiallyreducesclinicalbiasandimprovesestimationandpredictionaccuracy.Toovercomethebarrierofpatient-leveldatasharing,distributedalgorithmsaredevelopedtoconductstatisticalanalysesacrossmultiplesitesthroughsharingonlyaggregatedinformation.Thecurrentdistributedalgorithmoftenrequiresiterativeinformationevaluationandtransferringacrosssites,whichcanpotentiallyleadtoahighcommunicationcostinpracticalsettings.Inthisstudy,weproposeaprivacy-preservingandcommunication-efficientdistributedalgorithmforlogisticregressionwithoutrequiringiterativecommunicationsacrosssites.Oursimulationstudyshowedouralgorithmreachedcomparativeaccuracycomparingtotheoracleestimatorwheredataarepooledtogether.WeappliedouralgorithmtoanEHRdatafromtheUniversityofPennsylvaniahealthsystemtoevaluatetherisksoffetallossduetovariousmedicationexposures.

78

PLATYPUS:AMULTIPLE-VIEWLEARNINGPREDICTIVEFRAMEWORKFORCANCERDRUGSENSITIVITYPREDICTION

KileyGraim1,VerenaFriedl2,KathleenE.Houlahan3,JoshuaM.Stuart3

1FlatironInstitute&PrincetonUniversity,2UniversityofCaliforniaSantaCruz,3Ontario

InstituteofCancerResearchFriedl,VerenaCancerisacomplexcollectionofdiseasesthataretosomedegreeuniquetoeachpatient.Precisiononcologyaimstoidentifythebestdrugtreatmentregimeusingmoleculardataontumorsamples.Whileomics-leveldataisbecomingmorewidelyavailablefortumorspecimens,thedatasetsuponwhichcomputationallearningmethodscanbetrainedvaryincoveragefromsampletosampleandfromdatatypetodatatype.Methodsthatcan"connectthedots"toleveragemoreoftheinformationprovidedbythesestudiescouldoffermajoradvantagesformaximizingpredictivepotential.Weintroduceamulti-viewmachine-learningstrategycalledPLATYPUSthatbuilds"views"frommultipledatasourcesthatareallusedasfeaturesforpredictingpatientoutcomes.Weshowthatalearningstrategythatfindsagreementacrosstheviewsonunlabeleddataincreasestheperformanceofthelearningmethodsoveranysingleview.Weillustratethepoweroftheapproachbyderivingsignaturesfordrugsensitivityinalargecancercelllinedatabase.CodeandadditionalinformationareavailablefromthePLATYPUSwebsitehttps://sysbiowiki.soe.ucsc.edu/platypus.

79

ASOFTWAREPIPELINEFORDETERMININGFINE-SCALETEMPORALGENOMEVARIATIONPATTERNSINEVOLVINGPOPULATIONSUSINGANON-PARAMETRIC

STATISTICALTEST

MinjungKwak1,SeokwooKang2,DongwonChoo2,DohyeonLee2,JinheeLee2,SeonghyeonKim2,GiltaeSong2

1YeungnamUniversity,2PusanNationalUniveristy

Song,GiltaeAbnormalvariationsarefrequentinclonalgenomeevolutionofcancers.Suchaberrationalvariationsoftenfunctionasadriverincancercellgrowth.Understandingfundamentalevolutionarydynamicsunderlyingthesevariationsintumormetastasisstillisunderstudiedowingtotheirgeneticcomplexity.Recently,wholegenomesequencingempowerstodeterminegenomevariationsinshort-termevolutionofcellpopulations.Thisapproachhasbeenappliedtoevolvingpopulationsofmodelorganismsincludingyeast.Itissubstantialprogressinevolutionarygenomicstoexaminesequencechangesatsuchfine-scaleresolution.However,existingstatisticaltestsforanalyzingvariationtemporalchangesinmultipletime-pointsarelimitedtoidentifythefullspectrumofintermediatechanges.WedesignedanewstatisticalapproachbasedonKolmogorov-Smirnovtestandintegrateditintoasoftwaretoolfordeterminingthevariationpatternsinfine-scaletemporalresolutioninexperimentalevolutionstudies.Wevalidatedourmethodusingsimulationdatathatmimictheevolutionoffruitflypopulations.WecomparedtheresultsofoursandotherexistingmethodssuchastheCochran-Mantel-Haenszel(CMH)testandthebeta-binomialGaussianprocess(BBGP)method.Weanalyzedyeast(Saccharomycescerevisiae)W303straingenomesfrom40populationsat12time-pointsusingoursoftwarepipeline.Ourtoolsetcanbealsoappliedforidentifyingabnormalvariationchangesinotherevolvingpopulations.

80

ADEEPLEARNINGAPPROACHTOIDENTIFYINGTHECELLULARCOMPOSITIONOFSOLIDTISSUEWITHDNAMETHYLATIONDATA

MeghanE.Muse1,CurtisL.Petersen1,CarmenJ.Marsit2,DianeGilbert-Diamond1,BrockC.Christensen1

1DartmouthCollege,2EmoryUniversity

Muse,MeghanDNAmethylationisinvolvedintheestablishmentofcellularidentityandmeasuredprofilesofDNAmethylationcanbeleveragedtodeconvolutetheunderlyingcellularcompositionofatissuesample.Currently,bothreference-basedandreference-freemethodsexisttoestimatetherelativeproportionofinferredcelltypesinsolidtissueusingDNAmethylationdata.However,establishingDNAmethylationlibrariesforreference-baseddeconvolutioninsolidtissuesischallenginganduseofreference-freeapproachestoestimateputativecelltypeproportionsarecomputationallyintensive,particularlyassamplesizeincreases.AsobservedpatternsinDNAmethylationcanbemoststronglyexplainedbytherelativeproportionofcelltypesinatissuesample,weinvestigatedtheutilityofimplementinganunsupervisedvariationalautoencoder(VAE)approachtolearnadefinednumberoflatentdimensionsinDNAmethylationdataandtestedtheirrelationshipwithinferredcelltypeproportionsfromareference-freeapproach.WeimplementtheTybaltmodeldevelopedbyWayetal.tolearnlatentrepresentationsofDNAmethylationdatameasuredontheIllumina450Karrayin334placentalsamples.Wecomparetheresultsofthismethodtothosefromawell-establishedreferencefreemethodforinferringtherelativeproportionsofputativecelltypes.Weconsideredmodelsthatlearned10to100latentdimensionsandselectedthemodelinwhichthegreatestnumberofputativecelltypesidentifiedbythereferencefreemethodhadmoderatecorrelation(r2>0.5)withatleastonelatentdimension.Thisresultedintheselectionofamodellearning10latentdimensions.Inthismodel,learnedlatentdimensionshadmoderatecorrelationwith5ofthe9putativeplacentalcelltypesidentifiedbythereferencefreemethodandstrongcorrelation(r2>0.7)with2putativeplacentalcelltypes.Tobetterunderstandtheunderlyingbiologyrepresentedbytheselatentdimensions,weassesstheCpGlocimoststronglycorrelatedwiththeactivationsofthese5latentdimensionsasameansofidentifyinggenesthatarerepresentativeofcellularidentity.

81

DIRECTLYMEASURINGTHERATEANDDYNAMICSHUMANMUTATIONBYSEQUENCINGLARGE,MULTI-GENERATIONALPEDIGREES

ThomasA.Sasani,BrentS.Pedersen,MarkLeppert,RayWhite,LisaBaird,AaronR.Quinlan,LynnB.Jorde

DepartmentofHumanGenetics,UniversityofUtah

Quinlan,AaronDevelopinganaccurateestimateofthehumangermlinemutationrateiscriticaltoourunderstandingofevolution,demography,andgeneticdisease.Earlyphylogeneticanalysesinferredmutationratesfromtheobservedsequencedivergencebetweenhumansandrelatedprimatespeciesatparticulargenesandpseudogenes.However,aswholegenomesequencinghasbecomeubiquitous,theseestimateshavebeenrefinedusingpedigree-basedapproaches.Byidentifyingmutationspresentinoffspringthatareabsentfromtheirparents(denovomutations),itispossibletomoreaccuratelyapproximatethehumangermlinemutationrate.Toobtainaprecise,unbiasedestimateofthemutationrateinhumans,weperformeddeepwhole-genomesequencingonblood-derivedDNAfrom34oftheoriginalthree-generationCEPHfamiliesfromUtah,comprisingatotalof604individuals.Thesefamilies,whicheachcontaingrandparents(P0generation),parents(F1),andtheirchildren(F2),areconsiderablylargerthananyusedinpriorestimatesofthehumanmutationrate,andofferuniquepowertodetectandvalidatedenovomutation.Withamedianof8F2individualsperpedigree,wewereabletobiologicallyvalidateputativedenovomutationsintheF1generationbyassessingtheirtransmissiontoathirdgeneration.Usingthisdataset,wehavegeneratedahigh-confidenceestimateofthehumanmutationrate(1.31x10-8/bp/generation),observeasignificantparentalageeffectontherateofdenovomutation,andidentifywidevariabilityinfamily-specificageeffectsacrossCEPHpedigrees.Toourknowledge,thisstudyrepresentsthefirstexampleofalongitudinalanalysisoftheeffectofparentalagewithinindividualfamilies.Additionally,wehaveidentifiedrecurrentdenovovariantspresentinmultipleF2offspring,whicharelikelytheresultofmosaicismintheparentalgermline.Finally,wehavetrainedaclassificationmodelonthehigh-quality,transmitteddenovovariantsinourdataset,andusedthismodeltoidentifydenovomutationsinalargecohortofchildrenfromtheSimonsFoundationforAutismResearchInitiative(SFARI).Combiningthedenovomutationsobservedin34UtahfamilieswiththeSFARIcallset,wehavegeneratedadensegenomicmapofspontaneoushumanmutation.Weobserveregionalenrichmentofdenovovariationinthehumangenome,andexploretheroleofsequencecontext,aswellasmolecularprocesseslikerecombinationandgeneconversion,ontherateofhumanmutation.

82

AVAILABLEPROTEIN3DSTRUCTURESDONOTREFLECTHUMANGENETICANDFUNCTIONALDIVERSITY

GregorySliwoski,NeelPatel,R.MichaelSivley,CharlesR.Sanders,JensMeiler,WilliamS.Bush,JohnA.Capra

DepartmentofBiomedicalInformatics,VanderbiltUniversityMedicalCenter,Nashville,

TN,USA,CenterforStructuralBiology,VanderbiltUniversity,Nashville,TN,USA;InstituteforComputationalBiology,DepartmentofPopulationandQuantitativeHealth

Sciences,CaseWesternReserveUniversity,Cleveland,OH,USA;DepartmentofBiochemistry,VanderbiltUniversity,Nashville,TN,USA;DepartmentofMedicine,VanderbiltUniversityMedicalCenter,Nashville,TN,USA;DepartmentofChemistry,

VanderbiltUniversity,Nashville,TN,USA;InstituteforComputationalBiology,DepartmentofPopulationandQuantitativeHealthSciences,CaseWesternReserve

University,Cleveland,OH,USA;DepartmentofBiologicalSciences,VanderbiltUniversity,Nashville,TN,USA;VanderbiltGeneticsInstitute,VanderbiltUniversityMedicalCenter,

Nashville,TN,USABush,WilliamGenomicdatabasesandclinicaltrialsaresubstantiallybiasedtowardsEuropeanancestrypopulations,andthisbiassignificantlycontributestohealthdisparities.Structuralbiologyhasanessentialroleininvestigatingproteinfunctionandclinicalvariantinterpretation,providingpowerfultoolsforinvestigatingtheimpactofgeneticvariantsonproteinstructureandfunction.However,studiesthatanalyzethe3Dstructureofproteinstypicallyconsiderasinglecanonicalaminoacidsequenceasrepresentativeoftheprotein.Here,weevaluatethepotentialforthissimplificationtobiasresultstowarddifferentpopulationsbyevaluatinghowwell66,971experimentallycharacterizedhumanprotein3Dstructuresrepresentthesequencediversityoftheproteinstheymodel.Thousandsofproteinstructureshaveunrepresentedalternativesequencescommonlyfoundinhumanpopulations,andAfricanancestryindividuals'sequencesaretheleastlikelytoberepresentedbyavailablestructures.Becausesequencevariabilityisoftenlimitedtoafewpositionswithinaprotein,weevaluatethelikelihoodofthesesmallchangestoinfluenceproteinfunction.Combiningexistingannotationsandcomputationalmodeling,weidentifythousandsofproteinsforwhichuseofasinglestructureasrepresentativeof"wildtype"maybiasresultsagainstcertainpopulationsorindividuals.Variantssegregatinginhumanpopulations,butunrepresentedinstructures,areobservedacrossfunctionalsitesinvolvedinstability(134disulfidebondcysteines),regulation(94phosphorylationsites),DNAbinding(322residues),smallmoleculebinding(1,463residueswith362withindrugbindingsites),andprotein-proteininterfaces(6,144residues).Wecomputationallymodelmorethan700unrepresentedvariants'effectsonproteinstabilityandprotein-proteininteraction.Changesinpredictedproteinstabilityarefoundfor28%(156)ofthe556variants,withstabilizing(41)anddestabilizing(115)effectspredicted.Of161protein-interfacevariantsmodeled,25%(41)arepredictedtoimpactprotein-proteinbinding.Thesevariantsinhumanpopulationshavepotentialtoimpactthestudyoftheirprotein'sstructureandfunction.Withthewidespreaduseofproteinstructuresinbasicscienceandclinicalvariantinterpretation,humanproteinsequenceandstructuraldiversitymustbeconsideredtoenableaccurateandreproducibleconclusionsfromstructuralanalyses.

83

SEMANTICWORKFLOWSFORBENCHMARKCHALLENGES:ENHANCINGCOMPARABILITY,REUSABILITYANDREPRODUCIBILITY

ArunimaSrivastava1,RavaliAdusumilli2,HunterBoyce2,DanielGarijo3,VarunRatnakar3,RajivMayani3,ThomasYu4,RaghuMachiraju1,YolandaGil3,ParagMallick2

1TheOhioStateUniversity,2StanfordUniversity,3UniversityofSouthernCalifornia,4Sage

BionetworksSrivastava,ArunimaBenchmarkchallenges,suchastheCriticalAssessmentofStructurePrediction(CASP)andDialogueforReverseEngineeringAssessmentsandMethods(DREAM)havebeeninstrumentalindrivingthedevelopmentofbioinformaticsmethods.Typically,challengesareposted,andthencompetitorsperformapredictionbaseduponblindedtestdata.Challengersthensubmittheiranswerstoacentralserverwheretheyarescored.RecenteffortstoautomatethesechallengeshavebeenenabledbysystemsinwhichchallengerssubmitDockercontainers,aunitofsoftwarethatpackagesupcodeandallofitsdependencies,toberunonthecloud.Despitetheirincrediblevalueforprovidinganunbiasedtest-bedforthebioinformaticscommunity,thereremainopportunitiestofurtherenhancethepotentialimpactofbenchmarkchallenges.Specifically,currentapproachesonlyevaluateend-to-endperformance;itisnearlyimpossibletodirectlycomparemethodologiesorparameters.Furthermore,thescientificcommunitycannoteasilyreusechallengers'approaches,duetolackofspecifics,ambiguityintoolsandparametersaswellasproblemsinsharingandmaintenance.Lastly,theintuitionbehindwhyparticularstepsareusedisnotcaptured,astheproposedworkflowsarenotexplicitlydefined,makingitcumbersometounderstandtheflowandutilizationofdata.HereweintroduceanapproachtoovercometheselimitationsbasedupontheWINGSsemanticworkflowsystem.Specifically,WINGSenablesresearcherstosubmitcompletesemanticworkflowsaschallengesubmissions.Bysubmittingentriesasworkflows,itthenbecomespossibletocomparenotjusttheresultsandperformanceofachallenger,butalsothemethodologyemployed.Thisisparticularlyimportantwhendozensofchallengeentriesmayusenearlyidenticaltools,butwithonlysubtlechangesinparameters(andradicaldifferencesinresults).WINGSusesacomponentdrivenworkflowdesignandoffersintelligentparameteranddataselectionbyreasoningaboutdatacharacteristics.Thisprovestobeespeciallycriticalinbioinformaticsworkflowswhereusingdefaultorincorrectparametervaluesispronetodrasticallyalteringresults.Differentchallengeentriesmaybereadilycomparedthroughtheuseofabstractworkflows,whichalsofacilitatereuse.WINGSishousedonacloudbasedsetup,whichstoresdata,dependenciesandworkflowsforeasysharingandutility.ItalsohastheabilitytoscaleworkflowexecutionsusingdistributedcomputingthroughthePegasusworkflowexecutionsystem.WedemonstratetheapplicationofthisarchitecturetotheDREAMproteogenomicchallenge.

84

PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA

POSTERPRESENTATIONS

85

CLASSPRIORESTIMATIONANDQUANTIFICATIONOFTHELOSSANDGAINOFRESIDUEFUNCTIONUPONMUTATION

ShantanuJain1,JoseLugo-Martinez2,MarthaWhite3,MichaelW.Trosset4,PredragRadivojac1

1NortheasternUniversity,2Carnegie-MellonUniversity,3UniversityofAlberta,4Indiana

UniversityJain,ShantanuStandardalgorithmsforbinaryclassificationassumeaccesstolabeleddatafromboththepositiveandthenegativeclass.However,inmanybiologicalproblems,labeledexamplesfromoneoftheclasses(say,negatives)isnotavailable.Inthisscenario,apositive-unlabeledlearner,thatreliesonpositiveandunlabeledexamplesonly,isused.Surprisingly,thisstrategyleadstoanoptimalscorefunction.However,pickinganoptimalthresholdtoconstructthefinalclassifierrequirestheknowledgeoftheclasspriors,theproportionofpositivesandnegativesintheunlabeleddata.Iwill1)presentannonparametricalgorithmforestimationoftheclasspriorsbasedonamixturemodelformulation,2)elucidatetheassumptionsnecessaryforthealgorithm,and3)deriveaclasspriorpreservingunivariatetransformfordimensionalityreductionandtherebyobtainapracticalalgorithmformultivariatedata.Moreover,Iwillalsodemonstratehowtheposteriorcanbeestimatedusingtheestimateoftheclasspriors.Iwillfurtherextendtheseresultstoamoregeneralsettingwheresomeoftheexampleslabeledaspositiveareinfactnegative.Iwillpresentexperimentalresultsdemonstratingtheefficacyofouralgorithm,comparingitwiththestateoftheartmethodsandotherbaselinemethodsonmanyrealandsyntheticdatasets.Lastly,Iwillpresentabiologicalapplicationofthisworktoestablishthelossandgainofresiduefunctionasacommonmechanismforinheriteddiseases.

86

PREDICTIONOFTIMETOINSULINUSINGCLINICALANDGENETICBIOMARKERSINTYPE2DIABETESPATIENTS

RikkeLinnemannNielsen1,LouiseDonnelly2,AgnesMartineNielsen3,KonstantinosTsirigos1,KaixinZhou2,BjarneErsboell3,LineClemmensen3,EwanPearson2,Ramneek

Gupta1

1DepartmentofBioandHealthInformatics,TechnicalUniversityofDenmark;2MedicalResearchInstitute,UniversityofDundee,UnitedKingdom;3DepartmentofApplied

MathematicsandComputerScience,TechnicalUniversityofDenmarkNielsen,RikkeLinnemannTypeIIdiabetes(T2D)isacomplexmetabolicdisorderwheretheriskofafastorslowdiseaseprogressionishighlydependentofeachindividual.Therefore,itisusefultoidentifypredictivebiomarkersfordiabetesprogressionandrelevantpatientsubgroupscharacteristicsthatmayassistclinicaldecisionsinT2Dtreatmentmanagement.Inthisstudy,weobtainedelectronicmedicalrecordsfromacohort-basedpopulationinTayside,UKregisteredfromDecember1994toSeptember2015.Usinglife-styledata,anthropometry,biochemicaldata,drug-prescriptiondataandgeneticfeaturesfromelectronicmedicalrecordson6871T2Dpatients,artificialneuralnetworkmodels(ANN)weretrainedwithtwo-layercross-validationtoclassifyT2Dpatients’progressiongivenaspatients’timetoinsulin(TTI).TTIwasdefinedasthefirstdayofinsulintreatmentorastheclinicalneedforinsulin(HbA1c>8.5%treatedwithtwoormorenon-insulindiabetestherapies).PredictiontargetswereTTIwithinyear1,3or5fromthetimeofdiagnosis.GeneticvariantswereselectedbypriorknowledgeonT2DandglycemictraitpredispositionSNPsfrom~80MimputedSNPs.Predictionmodelswereinvestigatedforunderstandingwhichbiomarkersweremostpredictiveofprogression.ANNswithalldataexceptgeneticvariantspredictedTTIforyear1(0.92±0.02,0.83±0.04,0.86±0.04forAUC,sensitivityandspecificity,respectively),year3(0.82±0.03,0.71±0.05,0.78±0.04)andyear5(0.78±0.02,0.66±0.02,0.76±0.02).MostimportantfeaturesincludedHbA1c,GADantibodyconcentrationandthetypeofdiabetestherapypatientswerereceivingatthetimeofconfirmeddiagnosis.Integrationofgeneticvariants,usingaforwardselectionstrategy,resultedinaslightlyimprovedperformanceinallthreemodels;year1(0.94±0.01,0.83±0.03,0.90±0.01),year3(0.85±0.02,0.72±0.05,0.80±0.02),andyear5(0.80±0.03,0.68±0.04,0.78±0.02).WearecurrentlyexaminingtherobustnessoftheselectedSNPsbybuildinganensembleofmultiplemodelswithdifferentfeaturesandinvestigatingifthegeneticfeaturesarerelevanttospecificpatientsubgroups,aswellascarryingoutfurtherlongitudinalworkwiththephenotypetoincludemoreinformationaboutagivenpatientusinglongitudinalpatientinformationacrossirregularsampledtimepoints.

87

PATHOGENICITYANDFUNCTIONALIMPACTOFINSERTION/DELETIONANDSTOPGAINVARIATIONINTHEHUMANGENOME

KymberleighA.Pagel1,DannyAntaki2,MatthewMort3,DavidN.Cooper3,JonathanSebat2,LiliaM.Iakoucheva2,SeanD.Mooney4,PredragRadivojac5

1IndianaUniversity,2UniversityofCaliforniaSanDiego,3CardiffUniversity,4Universityof

Washington,5NortheasternUniversityRadivojac,PredragAnindividualhumanexomemaycontainhundredsofprotein-codinginsertion/deletions(indels)anddozensofproteintruncatingvariants.Accuratedifferentiationbetweenphenotypicallyneutralanddisease-causinggeneticvariationremainsanopenproblem,particularlyamongtheexcessofindelvariantsbroughtaboutbyrecentdevelopmentsinsequencingtechnologies.Indelandproteintruncatingvariantsexhibitdiverseimpactonproteinsequence,fromasingleresiduetothedeletionofentirefunctionaldomains.Wepresentmachinelearningmethodstopredictthepathogenicityandthetypesoffunctionalresiduesimpactedbyloss-of-functionandindelvariation.Themodelsshowgoodpredictiveperformanceandthepotentialtoidentifyeffectuponresidespredictedtoeffectstructuralandfunctionalfeatures,includingsecondarystructure,intrinsicdisorder,metalandmacromolecularbinding,post-translationalmodifications,andcatalyticresidues.WeidentifystructuralandfunctionalmechanismsthatareimpactedpreferentiallybygermlinevariationfromtheHumanGeneMutationDatabase,recurrentsomaticvariationinCOSMIC,anddenovovariationfromindividualswithneurodevelopmentaldisorders.Collectively,thepathogenicitypredictionandpredictedfunctionaleffectsprovideaframeworktofacilitatetheinterrogationofindelandproteintruncatingvariants.

88

DETECTINGPOTENTIALPLEIOTROPYACROSSCARDIOVASCULARANDNEUROLOGICALDISEASESUSINGUNIVARIATE,BIVARIATE,ANDMULTIVARIATE

METHODSON43,870INDIVIDUALSFROMTHEEMERGENETWORK

XinyuanZhang1,YogasudhaVeturi1,ShefaliS.Verma1,WilliamBone1,AnuragVerma1,AnastasiaM.Lucas1,ScottHebbring2,JoshuaC.Denny3,IanStanaway4,GailP.Jarvik4,DavidCrosslin4,EricB.Larson5,LauraRasmussen-Torvik6,SarahA.Pendergrass7,JordanW.Smoller8,HakonHakonarson9,PatrickSleiman9,ChunhuaWeng10,DavidFasel10,Wei-

QiWei3,IftikharKullo11,DanielSchaid11,WendyK.Chung10,MarylynD.Ritchie1

1UniversityofPennsylvania,2MarshfieldClinic,3VanderbiltUniversity,4Universityof

Washington,5KaiserPermanenteWashingtonHealthResearchInstitute,6NorthwesternUniversity,7GeisingerHealthSystem,8MassachusettsGeneralHospital,9Children's

HospitalofPhiladelphia,10ColumbiaUniversity,11MayoClinicZhang,XinyuanThelinkbetweencardiovasculardiseasesandneurologicaldisordershasbeenwidelyobservedintheagingpopulation.Diseasepreventionandtreatmentrelyonunderstandingthepotentialgeneticnexusofmultiplediseasesinthesecategories.Inthisstudy,wewereinterestedindetectingpleiotropy,orthephenomenoninwhichageneticvariantinfluencesmorethanonephenotype.Marker-phenotypeassociationapproachescanbegroupedintounivariate,bivariate,andmultivariatecategoriesbasedonthenumberofphenotypesconsideredatonetime.HereweappliedonestatisticalmethodpercategoryfollowedbyaneQTLcolocalizationanalysistoidentifypotentialpleiotropicvariantsthatcontributetothelinkbetweencardiovascularandneurologicaldiseases.Weperformedouranalyseson~530,000commonSNPscoupledwith65electronichealthrecord(EHR)-basedphenotypesin43,870unrelatedEuropeanadultsfromtheElectronicMedicalRecordsandGenomics(eMERGE)network.Therewere31variantsidentifiedbyallthreemethodsthatshowedsignificantassociationsacrosslateonsetcardiac-andneurologic-diseases.Wefurtherinvestigatedfunctionalimplicationsofgeneexpressiononthedetected"leadSNPs"viacolocalizationanalysis,providingadeeperunderstandingofthediscoveredassociations.Insummary,wepresenttheframeworkandlandscapefordetectingpotentialpleiotropyusingunivariate,bivariate,multivariate,andcolocalizationmethods.Furtherexplorationofthesepotentiallypleiotropicgeneticvariantswillworktowardunderstandingdiseasecausingmechanismsacrosscardiovascularandneurologicaldiseasesandmayassistinconsideringdiseasepreventionaswellasdrugrepositioninginfutureresearch.

89

PHARMGKB:THEAPIANDINFOBUTTONS

MichelleWhirl-Carrillo1,RyanM.Whaley1,MarkWoon1,RussB.Altman2,TeriE.Klein3

1DepartmentofBiomedicalDataScience,StanfordUniversity;2DepartmentofBioengineering,MedicineandGenetics,StanfordUniversity;3DepartmentofBiomedical

DataScienceandMedicine,StanfordUniversityAlena,OrlenkoWithPharmGKBisthelargestpubliclyavailableresourceforpharmacogenomics(PGx)discoveryandimplementation.Itsmissionistocollect,curate,integrateanddisseminateknowledgeabouthowhumangeneticvariationinfluencesdrugresponse.PharmGKBknowledgeisdefinedbyadatamodel,storedinadatabase,andaccessedthroughtheApplicationProgrammingInterface(API).TheAPIsuppliesdatatothewww.pharmgkb.orgwebsitewhichisthemostcommonwayforpeopletoqueryandviewtheknowledgecontentofPharmGKB.Additionally,thePharmGKBAPIsupportstheInfoButtonspecificationwhichisusedontheClinGenwebsiteaswellasbyothersintheirEHRsystems.TheInfobuttonImplementationGuideprovidesastandardmechanismforEHRsystemstosubmitknowledgerequeststoknowledgeresourcesovertheHTTPprotocolforpoint-of-caredecisionsupport.PharmGKBprovidesthisaspartofitsstandardAPI,usingRXCUIs(RxNormconceptuniqueidentifiers)andnormalizationofdrugnames,andreturnsHTML,withplanstosupportJSONandXML(https://api.pharmgkb.org/infobutton.html).ForInfoButtons,theEHRdisplaysabuttonfortheusertoclickthatwillquerythePharmGKBanddisplayinformationdirectlyintheEHRapplication.Insidetheapplication,alistofdrugidentifiers(RXCUIs)arecreatedandthensubmittedtotheInfoButtonservice’sURL.TheURLthenreturnsareportinHTMLthatisdisplayedtotheEHRuserdirectlyintheinterface.ThePharmGKBInfoButtonimplementationdisplaysdosingguidelineannotations,druglabelannotations,andtop-levelclinicalannotationsthatarerelevanttothedrugidentifiersprovidedbytheuser.WemonitortheAPIrequestlogstoassessusage.

90

SINGLECELLANALYSIS–WHATISINTHEFUTURE?

POSTERPRESENTATIONS

91

INTRATUMORHETEROGENEITY(ITH)METRICOFCIRCULATINGTUMORCELL(CTC)-DERIVEDXENOGRAFTMODELSINSMALLCELLLUNGCANCER.

YuanxinXi1,C.AllisonStewart2,CarlM.Gay2,HaiTran2,BonnieGlisson2,JohnV.Heymach2,PaulRobson3,LaurenA.Byers2,JingWang1

1DepartmentofBioinformaticsandComputationalBiology,TheUniversityofTexasMDAndersonCancerCenter,Houston,TX,USA;2DepartmentofThoracic/Head&NeckMedicalOncology,TheUniversityofTexasMDAndersonCancerCenter,Houston,TX,

USA;3TheJacksonLaboratoryforGenomicMedicine,Farmington,CT,USAXi,YuanxinSmallcelllungcancer(SCLC)isanaggressivemalignancycharacterizedbyrapidonsetofplatinum-resistance.Onceconsideredahomogeneousdisease,recentanalysesofSCLChaveshownintra-tumoralheterogeneity(ITH)associatedwithtreatment-resistance.Tofurtherinvestigatethecontributionofintra-tumoralheterogeneity(ITH)toclinicaloutcomesinSCLC,weprofiledsingle-cellRNAseqexpressionofcirculatingtumorcell(CTC)-derivedxenograft(CDX)modelsfromSCLCpatientsthatrecapitulatepatienttumorgenomicsandresponsetoplatinumchemotherapy.Characterizingtheheterogeneityoftumorcellsubpopulationsremainsabioinformaticschallengeinanalyzingsingle-cellRNAseqdataforCTC-derivedCDXmodels,mostlyduetolackofanaccuratemethodtoquantifythecomplexityoftumorcellexpressionpatternsatsinglecellresolutionanddiscoverthecorrelationswithdifferenttumordevelopmentortreatmentresponsemechanisms.Inthisstudy,wedevelopedavariance-basedmetrictomeasuretheoverallheterogeneityoftumorcellpopulationsbasedonsinglecellRNAseqexpressionprofilesWeappliedthismetrictotheChromium10xsinglecellRNAseqdataof4SCLCCDXmodelsthathasdifferentplatinumtreatmentresponses,andidentifiedaglobalincreaseofintra-tumorheterogeneityinplatinum-resistantmodelscomparedwithplatinum-sensitivemodels,anddefinedvariablegeneexpressionasareliablehallmarkofincreasingtherapeuticresistanceinSCLC.Furthergenesetenrichmentanalysis(GSEA)ofthetreatmentnaïveandrelapsedsamplesrevealedthattheincreasedITHmetricwereassociatedwithmultipleconcurrentresistancemechanisms,suggestingthatresistancetomolecularlytargetedtherapiesdoesnotfollowapredictable,reproduciblepathwaywithinthesameCDXmodel.Theseresultsshowedthatthevariance-basedITHmetricsuccessfullycharacterizedtheresistanceassociatedheterogeneityincreasesinSCLCtumorcells,andmorebroadly,itprovidesageneralpurposequantitativemeasurementofthetumorcellsubpopulationheterogeneityinsinglecellanalysis.

92

WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA

POSTERPRESENTATIONS

93

QUANTIFYINGTHEIDENTIFIABILITYOFINDIVIDUALSUSINGASPARSESETOFSNPS

PrashantS.Emani,GamzeGursoy,MarkB.Gerstein

DepartmentofMolecularBiophysicsandBiochemistry,YaleUniversityEmani,PrashantTherecentrevolutioninhigh-throughputgenomicshasledtotheproliferationofpubliclyavailabledatasetsanddatabasesenablingqueriesonindividualgenotypes,whetherintheformofreferencegenotypes,singlenucleotidepolymorphism(SNP)"beacons"orfunctionalgenomicsdatawithsignificantidentifying-informationleakage.ItisthereforeofinteresttoquantifythepowerofasparsesetofSNPstorevealtheidentityofanindividual,asthiswouldhelpdeterminetheprivacyrisksofmakingparticulardatasetsaccessibletotheresearchcommunity.Suchanevaluationwouldenableaprincipledcost-benefitanalysistodeterminetherightbalanceofpublicandprivatedataaccessibility.Wepresentatoolforsuchquantificationbasedonwell-establishedHiddenMarkovModels(HMMs)ofchromosomalrecombination(LiandStephens,2003):thecentralideaistoexplorethestatespaceofreferencehaplotypesfromadatabase,andfindthetrajectorythroughthisspacethatbestdescribesobservedgenotypes.ThetoolenablessimpleSNP-basedkinshipanalysisbytheidentificationofqueriedindividualsasa"mosaic",orpiecewisecombination,oftheinputreferencehaplotypes,whileallowingforgenotypingerroranddenovomutation.Theoutputincludesthebest-fitreferencehaplotypetrajectories,whichforasmallsetofinputSNPs,couldresultinseveralequal-probabilitypossibilities.However,eveninthiscase,inferencescouldbemadeonthemembershipofanindividualincertainhaplotypecommunitiesbasedontheirenrichmentwithinthebest-fittrajectories.Thisapproachparallelslinkagedisequilibrium-(LD-)basedmethods,butavoidsanyassumptionsofpopulationhomogeneityasitdoesnotrequireexplicitcalculationofallelefrequenciesorSNPcorrelations.Itis,ofcourse,dependentontheavailabilityofasufficientlyrichdatabasetoensurethatthequeriedindividualisatleastrelated.Thislimitationisfastbecominganon-issue,however,withtheconstantexpansionofpopulation-levelgenotypedatabases.Theresultsofrepresentativesimulationsusingthe1000GenomesreferencedatasetwithrandomlychosencommonSNPs(allelefrequency>0.05)fromasinglechromosomeare:searchingforagenotypedindividualamong100phasedgenotypes(=200referencehaplotypes)yieldedaccuratediscoverywithasfewas12SNPs;includingamutationrateof0.1–0.2increasedthenumberofSNPsrequiredforreliableidentificationto~25;simulationsofmosaicsamplescomposedoftworeferenceindividuals,eachcontributinghalfoftheSNPs,suggestedthat~30SNPscouldbesufficienttoidentifythetwoconstituentindividuals.Thesenumberswouldlikelybeimproveduponwhenallchromosomesarecombined.Insummary,weprovideatoolthatcanservetoidentifyobservedgenotypeseitherknowntobemembersofadatabase,orrelatedtoindividualswithinthedatabase,undervaryingconditionsofmutationandrecombinationrateswithnoassumptionsaboutthepopulation-specificallelefrequenciesofSNPs.

94

TRANSCRIPTOMICSUMMARYSPLICINGDATAMAYLEAKPERSONALPRIVATEINFORMATIONBYCOMPUTATIONALLINKAGETOTHEGENOMICVARIANTS

ZhiqiangHu1,MarkB.Gerstein2,StevenE.Brenner1

1UniversityofCalifornia,Berkeley,2YaleUniversityBrenner,StevenSharinggenomeswithoutpersonalidentifiershasbeencommonpracticeinbiologicalandmedicalresearch.However,recentstudiesrevealedtheriskofre-identifyingpeoplefromtheirgenomes,orattachedquasi-identifiers,suchassex,birthdate,andzipcode.Moreover,consumerdatabasesnowcontaingeneticdataformillionsofindividuals;arecentstudysuggestedthatmostAmericanshavedetectablefamilyrelationshipsinthesedatabases,allowingtheiridentificationusingdemographicidentifiers.Theadditionalavailabilityofanindividual’sRNA-seqdatahasimplicationsforprivacy,asitmaybelinkedtothegenome,potentiallyallowingtheperson’sprivacytobebreached.Forexample,sexandethnicityinformationmaybeinferreddirectlyfromagenome,andthestudymayprovideazipcode.ThisgenomecouldbelinkedtoRNA-seqdatafromadiabetesstudywithattachedbirthdatesandincome.Thesecombinedquasi-identifiersmayuniquelyidentifytheperson,andthestudyrevealstheperson’sdiabetesdiseasestatus.NEWPARAGRAPHRNA-seqreadscontaingeneticvariants,andthuscanbedirectlylinkedtothegenome.Toavoidthisrisk,someresearchersnowreleasegeneexpression,isoformexpressionandexonreadcountdatainsteadoftherawsequencingreads.NEWPARAGRAPHHowever,geneexpressioncanalsobelinkedtothegenomebasedonexpressionQTLs(eQTLs).UsingaBayesianframework,wefoundthatitisfeasibletopredictgenomicvariantsfromsummarizedsplicingdata.BasedonGTExsplicingQTLs(sQTL)data,usingrelativeisoformexpressionfrom15genes,wecouldidentifythetargetgenomewithinapoolcontaininghundredsofindividualswith>90%accuracy.WecouldalsolinkRNA-seqdatafromacertaintissueorcelltypetothegenomeusingparameterstrainedfromasimilartissue,indicatingparameterstrainedonmajortissuesmayenablethelinkageofRNA-seqfromalltypesofhumansamplestothegenome.ByquantitativelymeasuringtheinformationleakagefromeachsQTL,wefoundthatitispossibletoidentifythetargetgenomeofanRNA-seqdatasetfrommillionsofindividualsusingmoresQTLs.ResearchershaveproposedtoeliminatetheriskofeQTL-basedlinkingattacksbyaddingnoisetothegeneexpressions,basedontheobservationthatonlyafewgenesenablelinkage.However,ourframeworksuggestedthattherearenowmanymoresuchgenesthanpreviouslyreported.Wefindthatexpressiondataenablesthere-identificationoftargetgenomefromapoolcontainingbillionsofgenomes.Ourresultimpliesthatmitigationofthelinkingriskbyaddingnoisewouldseverelyabrogatebiologicalentityofthedata,sincethedatawillnolongerbebiologicallymeaningfulwhenoverhalfofgeneexpressionsaremodified.Ourstudyalsoimpliesthatotherkindsof“omic”data,includingDNAmodificationandproteinmetabolitelevels,mayalsoleakgenomeprivacy.

95

WORKSHOP:MERGINGHETEROGENEOUSDATATOENABLEKNOWLEDGEDISCOVERY

POSTERPRESENTATION

96

TOSEARCHAHETNET...HOWARETWONODESCONNECTED?

DanielHimmelstein1,MichaelZietz1,KyleKloster2,MichaelNagle3,BlairSullivan2,CaseyS.Greene1

1UniversityofPennsylvania,2NorthCarolinaStateUniversity,3PfizerInc.

Himmelstein,DanielNetworkswithmultiplenodeandrelationshiptypes,calledhetnets,provideanidealdatastructuretointegratebiomedicalknowledge.Oneexample,Hetionet,has47thousandnodesof11typesand2.25millionrelationshipsof24typescoveringdiseases,smallmoleculedrugs,andtheentitiesinbetween,whichrangefrommolecular(e.g.genes&pathways)toorganismal(e.g.sideeffects&symptoms).WearebuildingasearchengineforhetnetconnectivityontheHetionetnetwork.Wewanttoprovideuserswithanimmediateanswertothequestion,"howarethesetwonodesconnected?"Weapproachthisproblembyidentifyingtypesofpathswhereasourceandtargetnodeareconnectedmorethanexpectedbychance(i.e.basedontheirdegreesalone).WhilestillaworkinprogressonGitHub(https://github.com/greenelab/hetmech),theprojectisnearingaprototypewebapplication.Reachingthisstagerequiredseveralmethodologicaladvances.First,weimplementedefficientpathcountingalgorithmsinPythonbasedonmatrixmultiplication.AnewHetMatdatastructureprovidesefficienton-diskstorageofhetnets,optimizedformatrixoperationsandcaching.Wedesignedanovelgamma-hurdlemethodforassessingthenulldistributionofadegree-weightedpathcount(DWPC)foragivenpairofsource-targetnodedegrees.Usingthesetechniques,wecomputedmeasuresofconnectivitybetweenallnode-pairsforthe2,205typesofpaths(metapaths)withlength≤3inHetionetv1.0(availableathttps://doi.org/cww7).Now,weaimtoexposethehiddeninformationthesemeasurescapture:namely,howaretwonodesrelatedintermsofmetapaths,individualpaths,andintermediatenodes.Stopbyourpostertolearnmoreanddiscusshowthissearchenginecanhelpyouperusebiomedicalknowledgeorinterpretyourcomputationalpredictions.

97

WORKSHOP:TEXTMININGANDMACHINELEARNINGFORPRECISIONMEDICINE

POSTERPRESENTATION

98

LITVAR:MININGGENOMICVARIANTSFROMBIOMEDICALLITERATUREFORDATABASECURATIONANDPRECISIONMEDICINE

AlexisAllot,YifanPeng,Chih-HsuanWei,KyubumLee,LonPhan,ZhiyongLu

NationalLibraryofMedicine,8600RockvillePike,Bethesda,MD20894Lu,ZhiyongTheidentificationandinterpretationofgenomicvariantsplayakeyroleinthediagnosisofgeneticdiseasesandrelatedresearchintheeraofprecisionmedicine.Tostayuptodate,researchersmustprocessanever-increasingamountofnewpublications.Thistaskiscomplicatedbytwofactors.First,authorsusemultipleabbreviationstorefertothesamevariant.Forexample,"A146T","c.436G>A",andAla146Thrallrefertothesamevariantrs121913527.Second,thesameabbreviation(e.g.,p.Ala94Thr)canrefertodifferentvariantsindifferentgenes.AsimplesearchonPubMedwouldthusreturnonlyasubsetofallrelevantarticlesforthevariantofinterest,whilereturningmanyarticlesthatareirrelevant.

Tohelpscientists,healthcareprofessionals,anddatabasecuratorsfindthemostup-to-datepublishedvariantresearch,wehavedevelopedLitVar,anovelwebserverforlinkinggenomicvariantdataintheliteraturewithintuitiveUI(1).Specifically,itemploysasuiteofstate-of-the-artentityrecognitiontoolsasitsbackendprocessingmethod.LitVarcombinesrobustandadvancedtextminingwithdataintegrationsfromPubMed(>28millionabstracts)andPubMedCentralSubset(>2.7millionfull-lengtharticles)toimprovebothsensitivityandspecificity.AsofMay2018,therearemorethan2millionuniquevariantsinoursystem,associatedwithhundredsofthousandsofpublicationsfromPubMedandPMCOpenAccessSubset.WhilecomparingwithPubMed,LitVarachievedanincreaseinsensitivityandspecificity.Forexample,withasearchof"rs113488022",noresultscanbefoundinPubMed,butover6,000articlesarereturnedbyLitVar.Ontheotherhand,asearchfor"H199R"onPubMedwillreturnarticleswherethisvariantpresentsbothonthegeneLIN28B(PMID:22964795)andCFTR(PMID:15084222),whilethedisambiguationprocessofLitVarwillallowtheusertoselectpreciselythevariant(andgene)ofinterest.

Tofurtherassistusers,LitVarallowsmatchingpublicationstobefilteredbyjournal,type,dateorpartofpublication.Moreover,publications'popularityintimecanbevisualisedasazoomablehistogram.Inadditiontothewebsite,LitVarprovidesRESTAPIstoallowuserstodisambiguateatextualqueryintoalistoftopmatchingvariants,orperformlarge-scaleanalysis,byretrievingpublicationslinkedtohundredsofsupplieddbSNPidentifiersinonequery.

LitVarisnowintegratedindbSNP.ThenewlyaddedlinkallowsusersnotonlytoviewmorepublicationsthanwiththelinktoPubMed,butalsotoassessthecontext(sentenceandrelateddiseases,chemicalsandothervariants)inwhichthevariantappearsineachpublication.

LitVarispubliclyavailableathttps://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/LitVar.

[1]Allot,A.,Peng,Y.,Wei,C.H.,Lee,K.,Phan,L.andLu,Z.(2018)LitVar:asemanticsearchengineforlinkinggenomicvariantdatainPubMedandPMC.NucleicAcidsRes.

99

AUTHORINDEX

A

Abyzov,Alexej·65Adusumilli,Ravali·11,83Alkan,Can·67Allot,Alexis·98Altman,RussB.·89Anand,Shankara·31Andrechek,EranR.·60Antaki,Danny·87Ausavarungnirun,Rachata·67Azizi,Shekoofeh·34

B

Bae,Ho·32Baird,Lisa·81Baldwin,Edwin·53Beam,AndrewL.·33Beaulieu-Jones,BrettK.·33Bedi,Rishi·45Benchek,Penny·73Berger,Bonnie·29Berghout,Joanne·20Best,Aaron·7Bielinski,SuzetteJ.·46Bingöl,Zülal·67BlackIII,JohnLogan·46Bobak,Carly·47Bobe,JasonR.·28Boerwinkle,Eric·46Boland,MaryRegina·3,77Bone,William·21,88Boussard,SolineM.·48Boyce,Hunter·11,83Bradford,Yuki·39BrainSeqConsortium·49Brenner,StevenE.·94Brugler,MercerR.·74Burke,EmilyE.·49Bush,WilliamS.·73,82Byers,LaurenA.·91

C

Capra,JohnA.·82Carter,Hannah·10,36,70Castorino,John·62Chen,Bin·17,60Chen,Rachel·7,27Chen,Yang·23Chen,Yong·3,77Cheng,Li-Fang·14Choo,Dongwon·63,79Chow,Cheryl-Emiliane·64Chrisman,BriannaSierra·19Christensen,BrockC.·47,80Chung,WendyK.·21,88Clemmensen,Line·86Cohen,WilliamW.·37Collado-Torres,Leonardo·49Conway,Kathleen·69Cooper,Bruce·54Cooper,DavidN.·87Coukos,George·10Crosslin,David·21,88Cule,Madeleine·15

D

Dabbagh,Karim·64DeFreitas,JessicaK.·28De,Supriyo·50Deep-Soboslay,Amy·49DeJongh,Matthew·7Denny,JoshuaC.·21,88DePristo,Mark·15DeSantis,Todd·64Ding,DaisyYi·2Dinu,Valentin·59Doerr,Megan·43Donnelly,Louise·86Dow,Michelle·36Draghici,Sorin·51Duan,Rui·3,77Dudley,JoelT.·28

100

E

Edmiston,SharonN.·69Emani,PrashantS.·93Engelhardt,BarbaraE.·14Ersboell,Bjarne·86

F

Fan,Jungwei·20Fasel,David·21,88Feinberg,NanaNikolaishvili·69Fondran,Jeremy·73Fong,LonW.·75Fornes,Oriol·71Fraenkel,Ernest·41Francavilla,C.·72Friedl,Verena·5,78Friend,Derek·27Furukawa,Tetsu·52

G

Garijo,Daniel·11,83Gasdaska,Angela·27Gay,CarlM.·91Genolet,Raphael·10Gerstein,MarkB.·93,94Gfeller,David·10Ghose,Saugata·67Ghosh,Debashis·66Gibbs,RichardA.·46Gil,Yolanda·11,83Gilbert-Diamond,Diane·80Glanville,Jacob·45Glicksberg,BenjaminS.·28,60Glisson,Bonnie·91Gold,MaxwellP.·41Gonzalez-Hernandez,Graciela·9Gordon,Max·4Gorospe,Myriam·50Graham,Kareem·64Graim,Kiley·5,78Grayson,Shira·43Greene,CaseyS.·24,96Greenside,Peyton·15Gupta,Ramneek·86Gursoy,Gamze·93

H

Haas,DavidW.·39Haines,Jonathan·73Hakonarson,Hakon·21,88Han,Jiali·53Han,Wontack·16Harari,Alexandre·10Harris,KimberleyJ.·46Hebbring,Scott·21,88Henry,Christopher·7Hernandez-Boussard,Tina·48Heymach,JohnV.·91Hill,JaneE.·47Himmelstein,Daniel·96Ho,Irvin·18Hoffmann,ThomasJ.·61Houlahan,KathleenE.·5,78Hovde,Rachel·45Hsieh,Elena·66Hu,Qiwen·24Hu,Zhiqiang·94Hu,ZhiyueTom·17Huang,Beibei·75Huang,Haiyan·17Huang,Kun·25Hyde,ThomasM.·49

I

Iakoucheva,LiliaM.·87Iribarren,Carlos·61Iwai,Shoko·64

J

Jaffe,AndrewE.·49Jain,Shantanu·35,85Jarvik,GailP.·21,88Jiang,Yuexu·6Jin,Qiao·37Johnson,KippW.·28Johnson,Travis·25Jorde,LynnB.·81Jung,Jae-Yoon·19Jung,Kenneth·2

101

K

Kale,DaveC.·2Kalesinskas,Laurynas·31Kalsi,GurpreetS.·67Kang,Byungkon·57Kang,Seokwoo·79Kaserer,Bettina·55Khan,AlyA.·18Kiefel,Helena·64Kim,Dokyoon·57Kim,JeremieS.·67Kim,Seonghyeon·63,79Kim,WooJoo·58Klein,TeriE.·48,89Kleinman,JoelE.·49Kloster,Kyle·96Kober,KordM.·54Kohane,IsaacS.·33Krauss,RonaldM.·61Krunic,Milica·55Kullo,Iftikhar·21,88Kwak,Minjung·79Kwon,Sunyoung·32

L

Larson,EricB.·21,88Lau,Denise·18Le,TrangT.·56Lee,Byunghan·32Lee,Dohyeon·63,79Lee,Garam·57Lee,JaeKyung·58Lee,Jinhee·63,79Lee,Kyubum·98LeNail,Alexander·41Leppert,Mark·81Levine,JonD.·54Li,Binglan·39Li,Haiquan·20,53Li,Jianrong·20Li,Kevin·7Li,Qike·20Lim,Sooyeon·58Linan,Margaret·59Lindsey,William·7,27Liu,Zheng·8Liu,Ke·60

Lcontinued

Liu,Xiang·37Lu,Zhiyong·98Lucas,AnastasiaM.·21,39,88Lugo-Martinez,Jose·85Lussier,YvesA.·20

M

Machiraju,Raghu·11,83Magge,Arjun·9Mallick,Parag·11,83Marsit,CarmenJ.·80Mastick,Judy·54Mayani,Rajiv·11,83McKinney,BrettA.·56Medina,MarisaW.·61Meiler,Jens·82Miaskowski,Christine·54Miller,JasonE.·61Mooney,SeanD.·87Moore,Abigail·62Moore,JasonH.·3,56,77Moore,Sarah·43Mort,Matthew·87Mousavi,Parvin·34Müllauer,Leonhard·55Muse,MeghanE.·47,80Mutlu,Onur·67

N

Nagle,Michael·96Newbury,PatrickA.·17,60Nguyen,Tin·51Nguyen,Tuan-Minh·51Nho,Kwangsik·57Nielsen,AgnesMartine·86Nielsen,RikkeLinnemann·86Noh,JiYun·58Nori,AnantV.·67

O

O'Malley,A.James·47Oh,Dongpin·63Ouyang,Zhengqing·23

102

P

Pagel,KymberleighA.·87Panda,AmareshC.·50Parker,JoelS.·69Paskov,KelleyMarie·19Patel,Neel·82Paul,Steven·54Pearson,Ewan·86Pedersen,BrentS.·81Pendergrass,SarahA.·21,88Peng,Yifan·98Petersen,CurtisL.·80Peterson,Amy·49Peterson,SandraE.·46Pfohl,Stephen·2Phan,Lon·98Poplin,Ryan·15Prasad,Niranjani·14Pyke,RachelM.·10Pyman,Blake·34

Q

Quinlan,AaronR.·81

R

Radivojac,Predrag·35,85,87Rajpurohit,Anandita·49Ramola,Rashika·35Ramsey,StephenA.·8Rasmussen-Torvik,Laura·21,88Ratnakar,Varun·11,83Ravichandar,JayamaryDivya·64Reiman,Derek·18Renwick,Neil·34Richmond,PhillipA.·71Risch,Neil·61Ritchie,MarylynD.·21,39,48,61,88Robson,Paul·91Rodriguez,Estefania·74Roychowdhury,Tanmoy·65Rudra,Pratyaydipta·66Rutherford,Erica·64

S

Sahinalp,Cenk·29Salit,Marc·15Sanders,CharlesR.·82Sarker,Abeed·9Sasani,ThomasA.·81Schaid,Daniel·21,88Scherer,Steven·46Schwartz,J.M.·72Scotch,Matthew·9Sebat,Jonathan·87Sedghi,Alireza·34Semick,StephenA.·49SenolCali,Damla·67Sha,Lingdao·18Shafi,Adib·51Shah,NigamH.·2Shin,JooHeon·49Sicotte,Hugues·46Simmons,Sean·29Simpson,Chloe·2Sivley,R.Michael·82Skola,Dylan·36Sleiman,Patrick·21,88Sliwoski,Gregory·82Smail,Craig·31Smoller,JordanW.·21,88Sohn,Kyung-Ah·57Song,Giltae·63,79Srivastava,Arunima·11,83Stanaway,Ian·21,88Stewart,C.Allison·91Stockham,NateTyler·19Straub,RIchardE.·49Stuart,JoshuaM.·5,78Subramanian,Lavanya·67Subramoney,Sreenivas·67Sullivan,Blair·96Suver,Christine·43

T

Takenaka,Yoichi·68Tan,Timothy·18Tanigawa,Yosuke·31Tao,Ran·49Tao,Yifeng·37

103

Tcontinued

Theusch,Elizabeth·61Thomas,NancyE.·69Tintle,Nathan·7,27Titus,AlexanderJ.·47Toh,Hiroyuki·52Tran,Hai·91Trosset,MichaelW.·85Tsai,Yihsuan·69Tsirigos,Konstantinos·86Tsui,Brian·36,70Tyryshkin,Kathrin·34

U

Ulrich,WilliamS.·49Urbanowicz,RyanJ.·56

V

Valencia,Cristian·49vanderLee,Robin·71Varma,Maya·19Venhuizen,Peter·55Verma,Anurag·21,39,88Verma,ShefaliS.·21,39,88Veturi,Yogasudha·21,39,88Vitali,Francesca·20vonHaeseler,Arndt·55

W

Wagner,Jennifer·43Wall,DennisPaul·19Wang,Duolin·6Wang,Haohan·12,37Wang,Jing·91Wang,Junwen·59Wang,Liewei·46Wang,Tongxin·25Washington,PeterYigitcan·19Wasserman,WyethW.·71Watson,J.·72Weeder,Benjamin·8Wei,Chih-Hsuan·98Wei,Qi·8Wei,Wei-Qi·21,88

Wcontinued

Weinberger,DanielR·49Weinmaier,Thomas·64Weinshilboum,Richard·46Weissenbacher,Davy·9Weng,Chunhua·21,88Westra,Jason·27Whaley,RyanM.·89Wheeler,Nicholas·73Whirl-Carrillo,Michelle·48,89White,Martha·85White,Ray·81Wilbanks,John·43Williams,Cranos·4Woon,Mark·89Wu,Yonggan·64Wu,Zhenglin·12

X

Xi,Yuanxin·91Xiao,Madelyne·74Xing,EricP.·12,37Xu,Dong·6

Y

Yao,Yao·8Ye,Wenting·37Ye,Yuting·17Ye,Yuzhen·16Yin,Fei·53Yoon,Sungroh·32Yu,Thomas·11,83

Z

Zawistowski,Matthew·27Zeng,William·60Zhang,Jie·25Zhang,Shuxing·75Zhang,Xinyuan·21,88Zhang,Yuping·23Zhou,Jin·53Zhou,Kaixin·86Zietz,Michael·96Zook,Justin·15

Recommended