Upload
others
View
1
Download
0
Embed Size (px)
PACIFICSYMPOSIUMONBIOCOMPUTING2019
ABSTRACTBOOK
PosterPresenters:Posterspaceisassignedbyabstractpagenumber.Pleasefindthepagethatyourabstractisonandputyourposterontheposterboardwiththecorrespondingnumber(e.g.,ifyourabstractison
page50,putyourposteronboard#50).
Proceedingspaperswithoralpresentations#2-29arenotassignedposterspace.
Abstractsareorganizedfirstbysession,thenthelastnameofthefirstauthor.Presentingauthors’namesareinboldtext.
i
TABLEOFCONTENTS
PROCEEDINGSPAPERSWITHORALPRESENTATIONPATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK..........................................................................................................................................................................1THEEFFECTIVENESSOFMULTITASKLEARNINGFORPHENOTYPINGWITHELECTRONICHEALTHRECORDSDATA....................................................................................................................................................2DaisyYiDing,ChloeSimpson,StephenPfohl,DaveC.Kale,KennethJung,NigamH.Shah.................2
ODAL:AONE-SHOTDISTRIBUTEDALGORITHMTOPERFORMLOGISTICREGRESSIONSONELECTRONICHEALTHRECORDSDATAFROMMULTIPLECLINICALSITES................................................3RuiDuan,MaryReginaBoland,JasonH.Moore,YongChen..............................................................................3
PVCDETECTIONUSINGACONVOLUTIONALAUTOENCODERANDRANDOMFORESTCLASSIFIER.4MaxGordon,CranosWilliams..........................................................................................................................................4
PLATYPUS:AMULTIPLE–VIEWLEARNINGPREDICTIVEFRAMEWORKFORCANCERDRUGSENSITIVITYPREDICTION..................................................................................................................................................5KileyGraim,VerenaFriedl,KathleenE.Houlahan,JoshuaM.Stuart............................................................5
DEEPDOM:PREDICTINGPROTEINDOMAINBOUNDARYFROMSEQUENCEALONEUSINGSTACKEDBIDIRECTIONALLSTM....................................................................................................................................6YuexuJiang,DuolinWang,DongXu.............................................................................................................................6
IMPLEMENTINGANDEVALUATINGAGAUSSIANMIXTUREFRAMEWORKFORIDENTIFYINGGENEFUNCTIONFROMTNSEQDATA...........................................................................................................................7KevinLi,RachelChen,WilliamLindsey,AaronBest,MatthewDeJongh,ChristopherHenry,NathanTintle.............................................................................................................................................................................................7
RES2S2AM:DEEPRESIDUALNETWORK-BASEDMODELFORIDENTIFYINGFUNCTIONALNONCODINGSNPSINTRAIT-ASSOCIATEDREGIONS.............................................................................................8ZhengLiu,YaoYao,QiWei,BenjaminWeeder,StephenA.Ramsey...............................................................8
BI-DIRECTIONALRECURRENTNEURALNETWORKMODELSFORGEOGRAPHICLOCATIONEXTRACTIONINBIOMEDICALLITERATURE.............................................................................................................9ArjunMagge,DavyWeissenbacher,AbeedSarker,MatthewScotch,GracielaGonzalez-Hernandez.......9
COMPUTATIONALKIRCOPYNUMBERDISCOVERYREVEALSINTERACTIONBETWEENINHIBITORYRECEPTORBURDENANDSURVIVAL...............................................................................................10RachelM.Pyke,RaphaelGenolet,AlexandreHarari,GeorgeCoukos,DavidGfeller,HannahCarter......................................................................................................................................................................................................10
SEMANTICWORKFLOWSFORBENCHMARKCHALLENGES:ENHANCINGCOMPARABILITY,REUSABILITYANDREPRODUCIBILITY......................................................................................................................11ArunimaSrivastava,RavaliAdusumilli,HunterBoyce,DanielGarijo,VarunRatnakar,RajivMayani,ThomasYu,RaghuMachiraju,YolandaGil,ParagMallick.............................................................11
REMOVINGCONFOUNDINGFACTORSASSOCIATEDWEIGHTSINDEEPNEURALNETWORKSIMPROVESTHEPREDICTIONACCURACYFORHEALTHCAREAPPLICATIONS.......................................12HaohanWang,ZhenglinWu,EricP.Xing...............................................................................................................12
ii
PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA..................................................................................................................................................13ANOPTIMALPOLICYFORPATIENTLABORATORYTESTSININTENSIVECAREUNITS.....................14Li-FangCheng,NiranjaniPrasad,BarbaraE.Engelhardt.............................................................................14
CROWDVARIANT:ACROWDSOURCINGAPPROACHTOCLASSIFYCOPYNUMBERVARIANTS......15PeytonGreenside,JustinZook,MarcSalit,RyanPoplin,MadeleineCule,MarkDePristo.................15
AREPOSITORYOFMICROBIALMARKERGENESRELATEDTOHUMANHEALTHANDDISEASESFORHOSTPHENOTYPEPREDICTIONUSINGMICROBIOMEDATA...............................................................16WontackHan,YuzhenYe................................................................................................................................................16
AICM:AGENUINEFRAMEWORKFORCORRECTINGINCONSISTENCYBETWEENLARGEPHARMACOGENOMICSDATASETS..............................................................................................................................17ZhiyueTomHu,YutingYe,PatrickA.Newbury,HaiyanHuang,BinChen...............................................17
INTEGRATINGRNAEXPRESSIONANDVISUALFEATURESFORIMMUNEINFILTRATEPREDICTION...........................................................................................................................................................................18DerekReiman,LingdaoSha,IrvinHo,TimothyTan,DeniseLau,AlyA.Khan........................................18
OUTGROUPMACHINELEARNINGAPPROACHIDENTIFIESSINGLENUCLEOTIDEVARIANTSINNONCODINGDNAASSOCIATEDWITHAUTISMSPECTRUMDISORDER....................................................19MayaVarma,KelleyMariePaskov,Jae-YoonJung,BriannaSierraChrisman,NateTylerStockham,PeterYigitcanWashington,DennisPaulWall.................................................................................19
PRECISIONDRUGREPURPOSINGVIACONVERGENTEQTL-BASEDMOLECULESANDPATHWAYTARGETINGINDEPENDENTDISEASE-ASSOCIATEDPOLYMORPHISMS....................................................20FrancescaVitali,JoanneBerghout,JungweiFan,JianrongLi,QikeLi,HaiquanLi,YvesA.Lussier......................................................................................................................................................................................................20
DETECTINGPOTENTIALPLEIOTROPYACROSSCARDIOVASCULARANDNEUROLOGICALDISEASESUSINGUNIVARIATE,BIVARIATE,ANDMULTIVARIATEMETHODSON43,870INDIVIDUALSFROMTHEEMERGENETWORK.......................................................................................................21XinyuanZhang,YogasudhaVeturi,ShefaliS.Verma,WilliamBone,AnuragVerma,AnastasiaM.Lucas,ScottHebbring,JoshuaC.Denny,IanStanaway,GailP.Jarvik,DavidCrosslin,EricB.Larson,LauraRasmussen-Torvik,SarahA.Pendergrass,JordanW.Smoller,HakonHakonarson,PatrickSleiman,ChunhuaWeng,DavidFasel,Wei-QiWei,IftikharKullo,DanielSchaid,WendyK.Chung,MarylynD.Ritchie................................................................................................................................................................21
SINGLECELLANALYSIS–WHATISTHEFUTURE?....................................................................................22LISA:ACCURATERECONSTRUCTIONOFCELLTRAJECTORYANDPSEUDO-TIMEFORMASSIVESINGLECELLRNA-SEQDATA.........................................................................................................................................23YangChen,YupingZhang,ZhengqingOuyang....................................................................................................23
PARAMETERTUNINGISAKEYPARTOFDIMENSIONALITYREDUCTIONVIADEEPVARIATIONALAUTOENCODERSFORSINGLECELLRNATRANSCRIPTOMICS.......................................................................24QiwenHu,CaseyS.Greene..............................................................................................................................................24
TOPOLOGICALMETHODSFORVISUALIZATIONANDANALYSISOFHIGHDIMENSIONALSINGLE-CELLRNASEQUENCINGDATA......................................................................................................................................25TongxinWang,TravisJohnson,JieZhang,KunHuang.....................................................................................25
iii
WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA.......................................................................................................................................26LEVERAGINGSUMMARYSTATISTICSTOMAKEINFERENCESABOUTCOMPLEXPHENOTYPESINLARGEBIOBANKS................................................................................................................................................................27AngelaGasdaska,DerekFriend,RachelChen,JasonWestra,MatthewZawistowski,WilliamLindsey,NathanTintle.....................................................................................................................................................27
EVALUATIONOFPATIENTRE-IDENTIFICATIONUSINGLABORATORYTESTORDERSANDMITIGATIONVIALATENTSPACEVARIABLES........................................................................................................28KippW.Johnson,JessicaK.DeFreitas,BenjaminS.Glicksberg,JasonR.Bobe,JoelT.Dudley.......28
PROTECTINGGENOMICDATAPRIVACYWITHPROBABILISTICMODELING...........................................29SeanSimmons,BonnieBerger,CenkSahinalp.....................................................................................................29
PROCEEDINGSPAPERSWITHPOSTERPRESENTATIONSPATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK.......................................................................................................................................................................30SNPS2CHIP:LATENTFACTORSOFCHIP-SEQTOINFERFUNCTIONSOFNON-CODINGSNPS...........31ShankaraAnand,LaurynasKalesinskas,CraigSmail,YosukeTanigawa................................................31
DNASTEGANALYSISUSINGDEEPRECURRENTNEURALNETWORKS.......................................................32HoBae,ByunghanLee,SunyoungKwon,SungrohYoon...................................................................................32
LEARNINGCONTEXTUALHIERARCHICALSTRUCTUREOFMEDICALCONCEPTSWITHPOINCAIRÉEMBEDDINGSTOCLARIFYPHENOTYPES................................................................................................................33BrettK.Beaulieu-Jones,IsaacS.Kohane,AndrewL.Beam...........................................................................33
EXPLORINGMICRORNAREGULATIONOFCANCERWITHCONTEXT-AWAREDEEPCANCERCLASSIFIER.............................................................................................................................................................................34BlakePyman,AlirezaSedghi,ShekoofehAzizi,KathrinTyryshkin,NeilRenwick,ParvinMousavi........34
ESTIMATINGCLASSIFICATIONACCURACYINPOSITIVE-UNLABELEDLEARNING:CHARACTERIZATIONANDCORRECTIONSTRATEGIES.....................................................................................35RashikaRamola,ShantanuJain,PredragRadivojac.........................................................................................35
EXTRACTINGALLELICREADCOUNTSFROM250,000HUMANSEQUENCINGRUNSINSEQUENCEREADARCHIVE......................................................................................................................................................................36BrianTsui,MichelleDow,DylanSkola,HannahCarter....................................................................................36
AUTOMATICHUMAN-LIKEMININGANDCONSTRUCTINGRELIABLEGENETICASSOCIATIONDATABASEWITHDEEPREINFORCEMENTLEARNING......................................................................................37HaohanWang,XiangLiu,YifengTao,WentingYe,QiaoJin,WilliamW.Cohen,EricP.Xing..........37
iv
PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA..................................................................................................................................................38
INFLUENCEOFTISSUECONTEXTONGENEPRIORITIZATIONFORPREDICTEDTRANSCRIPTOME-WIDEASSOCIATIONSTUDIES........................................................................................................................................39BinglanLi,YogasudhaVeturi,YukiBradford,ShefaliS.Verma,AnuragVerma,AnastasiaM.Lucas,DavidW.Haas,MarylynD.Ritchie............................................................................................................................39
SINGLECELLANALYSIS–WHATISTHEFUTURE?....................................................................................40SHALLOWSPARSELY-CONNECTEDAUTOENCODERSFORGENESETPROJECTION............................41MaxwellP.Gold,AlexanderLeNail,ErnestFraenkel.........................................................................................41
WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA.......................................................................................................................................42IMPLEMENTINGAUNIVERSALINFORMEDCONSENTPROCESSFORTHEALLOFUSRESEARCHPROGRAM................................................................................................................................................................................43MeganDoerr,ShiraGrayson,SarahMoore,ChristineSuver,JohnWilbanks,JenniferWagner......43
POSTERPRESENTATIONSGENERAL.................................................................................................................................................................44ACONVOLUTIONALNEURALNETPREDICTSBINDINGPROPERTIESOFANANTIBODYLIBRARY45RishiBedi,RachelHovde,JacobGlanville................................................................................................................45
CNVAR:ASOFTWARETOOLFORGENOTYPINGCYP2D6USINGSHORTREADNEXTGENERATIONSEQUENCINGTECHNOLOGY...........................................................................................................................................46JohnLoganBlackIIIMD,HuguesSicottePhD,SandraE.Peterson,KimberleyJ.Harris,LieweiWangMDPhD,StevenSchererPhD,EricBoerwinklePhD,RichardA.GibbsPhD,SuzetteJ.BielinskiPhD,RichardWeinshilboumMD...................................................................................................................................46
NETWORKANALYSISOFDISTINCTCOHORTSALLOWSFORTHECOMPARISONOFKEYBIOLOGICALFUNCTIONSRELATEDTOTBPATHOGENESIS...........................................................................47CarlyBobak,MeghanE.Muse,AlexanderJ.Titus,BrockC.Christensen,A.JamesO'Malley,JaneE.Hill..............................................................................................................................................................................................47
VARIATIONINOPIOIDPRESCRIBINGPATTERNSINSURGICALPOPULATIONS....................................48SolineM.Boussard,MarylynD.Ritchie,MichelleWhirl-Carrillo,TinaHernandez-Boussard,TeriE.Klein......................................................................................................................................................................................48
REGIONALHETEROGENEITYINGENEEXPRESSION,REGULATIONANDCOHERENCEINHIPPOCAMPUSANDDORSOLATERALPREFRONTALCORTEXACROSSDEVELOPMENTANDSCHIZOPHRENIA..................................................................................................................................................................49LeonardoCollado-Torres,EmilyE.Burke,AmyPeterson,JooHeonShin,RIchardE.Straub,AnanditaRajpurohit,StephenA.Semick,WilliamS.Ulrich,BrainSeqConsortium,CristianValencia,RanTao,AmyDeep-Soboslay,ThomasM.Hyde,JoelE.Kleinman,DanielRWeinberger,,AndrewE.Jaffe1....................................................................................................................................................................49
FULL-LENGTHSEQUENCEASSEMBLYANDCHARACTERIZATIONOFHIGHLYPURIFIEDCIRCRNAISOFORMS................................................................................................................................................................................50SupriyoDe,AmareshC.Panda,MyriamGorospe.................................................................................................50
v
ACOMPREHENSIVEREVIEWANDASSESSMENTOFEXISTINGPATHWAYANALYSISAPPROACHES.........................................................................................................................................................................51Tuan-MinhNguyen,AdibShafi,TinNguyen,SorinDraghici.........................................................................51
ANEWPHYLOGENETICSAMPLINGMETHODUSINGGENERALIZED-ENSEMBLEALGORITHM.....52TetsuFurukawa,HiroyukiToh....................................................................................................................................52
CONVERGENTMECHANISMSPERTURBEDBYSCATTEREDSNPSSUSCEPTIBLETOALZHEIMER'SDISEASE....................................................................................................................................................................................53JialiHan,EdwinBaldwin,JinZhou,FeiYin,HaiquanLi,...................................................................................53
IDENTIFICATIONANDEVALUATIONOFCO-EXPRESSIONGENENETWORKSFORPACLITAXEL-INDUCEDPERIPHERALNEUROPATHYINBREASTCANCERSURVIVORS.................................................54KordM.Kober,JonD.Levine,JudyMastick,BruceCooper,StevenPaul,ChristineMiaskowsk1.....54
VARIFI-WEB-BASEDAUTOMATICVARIANTIDENTIFICATION,FILTERINGANDANNOTATIONOFAMPLICONSEQUENCINGDATA....................................................................................................................................55MilicaKrunic,PeterVenhuizen,LeonhardMüllauer,BettinaKaserer,ArndtvonHaeseler............55
STATISTICALINFERENCERELIEF(STIR)FEATURESELECTION..................................................................56TrangT.Le,RyanJ.Urbanowicz,JasonH.Moore,BrettA.McKinney.........................................................56
DEEPLEARNING-BASEDLONGITUDINALHETEROGENEOUSDATAINTEGRATIONFRAMEWORKFORAD-RELEVANTFEATUREEXTRACTION..........................................................................................................57GaramLee,KwangsikNho,ByungkonKang,Kyung-AhSohn,DokyoonKim..........................................57
MICROBIOMEANALYSISOFUNEXPLAINEDCASESOFPNEUMONIAINSOUTHKOREA....................58SooyeonLim,JaeKyungLee,JiYunNoh,WooJooKim......................................................................................58
POTRA:PATHWAYANALYSISOFCANCERGENOMICSDATAINTHECLOUD.........................................59MargaretLinan,JunwenWang,ValentinDinu.....................................................................................................59
EVALUATINGCELLLINESASMODELSFORMETASTATICCANCERTHROUGHINTEGRATIVEANALYSISOFOPENGENOMICDATA..........................................................................................................................60KeLiu,PatrickA.Newbury,BenjaminS.Glicksberg,WilliamZeng,EranR.Andrechek,BinChen60
PATHWAYANALYSISOFEHRANDNON-EHR-BASEDGWASCONNECTSLIPIDMETABOLISMTOTHEIMMUNERESPONSE.................................................................................................................................................61JasonE.Miller,ThomasJ.Hoffmann,ElizabethTheusch,CarlosIribarren,MarisaW.Medina,NeilRisch,RonaldM.Krauss,MarylynD.Ritchie............................................................................................................61
META-ANALYSISOFHETEROGENEITYANDBATCHEFFECTSINTHEA549CELLLINE...................62AbigailMoore,JohnCastorino.....................................................................................................................................62
HYPERPARAMETERTUNINGFORCHIP-SEQPEAKCALLINGSOFTWARETOOLSUSINGPARALLELIZEDBAYESIANOPTIMIZATION.............................................................................................................63DongpinOh,JinheeLee,SeonghyeonKim,DohyeonLee,DongwonChoo,GiltaeSong.......................63
vi
CROSS-STUDYMETA-ANALYSISIDENTIFIESALTEREDBACTERIALSTRAINSSEPARATINGRESPONDERANDNON-RESPONDERPOPULATIONSACROSSMULTIPLECHECKPOINT-INHIBITORTHERAPYDATASETS..........................................................................................................................................................64JayamaryDivyaRavichandar,EricaRutherford,YongganWu,ThomasWeinmaier,Cheryl-EmilianeChow,ShokoIwai,HelenaKiefel,KareemGraham,KarimDabbagh,ToddDeSantis.......64
AHYPOTHESISOFTHESTABILIZINGROLEOFALUEXPANSIONVIAHOMOLOGYDIRECTEDREPAIROFSPONTANEOUSDNADOUBLESTRANDEDBREAKS....................................................................65TanmoyRoychowdhury,AlexejAbyzov....................................................................................................................65
STATISTICALLEARNINGWITHHIGH-DIMENSIONALMASSCYTOMETRYDATA..................................66PratyaydiptaRudra,ElenaHsieh,DebashisGhosh............................................................................................66
HARDWAREACCELERATIONOFAPPROXIMATESTRINGMATCHINGFORBOTHSHORTANDLONGREADMAPPING.......................................................................................................................................................67DamlaSenolCali,LavanyaSubramanian,ZülalBingöl,JeremieS.Kim,RachataAusavarungnirun,AnantV.Nori,GurpreetS.Kalsi,SreenivasSubramoney,SaugataGhose,CanAlkan,OnurMutlu
TRANSITIONOFREGULATORYFORCETOWARDTHEGENEEXPRESSIONSDURINGOSTEOBLASTCELLDIFFERENTIATION..................................................................................................................................................68YoichiTakenaka................................................................................................................................................................68
METHYLATIONPROFILESOFMELANOMATOPREDICTTILS........................................................................69YihsuanTsai,NanaNikolaishviliFeinberg,KathleenConway,SharonN.Edmiston,NancyE.Thomas,JoelS.Parker.......................................................................................................................................................69
HIGH-THROUGHPUTGENETOKNOWLEDGEMAPPINGTHROUGHMASSIVEINTEGRATIONOFPUBLICSEQUENCINGDATA............................................................................................................................................70BrianTsui,HannahCarter.............................................................................................................................................70
MANTA-RAE,PREDICTINGTHEIMPACTOFGENOMEVARIANTSONTHETRANSCRIPTIONFACTORBINDINGPOTENTIALOFREGULATORYELEMENTS........................................................................71RobinvanderLee,PhillipA.Richmond,OriolFornes,WyethW.Wasserman.......................................71
USINGQUANTITATIVEPHOSPHOPROTEOMICSTOUNDERSTANDFUNCTIONALSELECTIVITYOFRECEPTORTYROSINEKINASES....................................................................................................................................72J.Watson,C.Francavilla,J.M.Schwartz....................................................................................................................72
ANERISAPPLIED:SPARK-ENABLEDANALYTICSFORFULL-SCALEANDREPRODUCIBLEANNOTATION-BASEDGENOMICSTUDIES...............................................................................................................73NicholasWheeler,JeremyFondran,PennyBenchek,JonathanHaines,WilliamS.Bush..................73
PUTTINGRELICANTHUSINITSPLACE:IMPACTOFMIXTUREMODELCHOICEONPHYLOGENETICRECONSTRUCTION...........................................................................................................................74MadelyneXiao,MercerR.Brugler,EstefaniaRodriguez..................................................................................74
RATIONALDESIGNOFNOVELSKP2INHIBITORSUSINGDEEPNEURALNETWORKS........................75ShuxingZhang,BeibeiHuang,LonW.Fong..........................................................................................................75
vii
PATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK.......................................................................................................................................................................76ODAL:AONE-SHOTDISTRIBUTEDALGORITHMTOPERFORMLOGISTICREGRESSIONSONELECTRONICHEALTHRECORDSDATAFROMMULTIPLECLINICALSITES.............................................77RuiDuan,MaryReginaBoland,JasonH.Moore,YongChen...........................................................................77
PLATYPUS:AMULTIPLE-VIEWLEARNINGPREDICTIVEFRAMEWORKFORCANCERDRUGSENSITIVITYPREDICTION...............................................................................................................................................78KileyGraim,VerenaFriedl,KathleenE.Houlahan,JoshuaM.Stuart.........................................................78
ASOFTWAREPIPELINEFORDETERMININGFINE-SCALETEMPORALGENOMEVARIATIONPATTERNSINEVOLVINGPOPULATIONSUSINGANON-PARAMETRICSTATISTICALTEST............79MinjungKwak,SeokwooKang,DongwonChoo,DohyeonLee,JinheeLee,SeonghyeonKim,GiltaeSong...........................................................................................................................................................................................79
ADEEPLEARNINGAPPROACHTOIDENTIFYINGTHECELLULARCOMPOSITIONOFSOLIDTISSUEWITHDNAMETHYLATIONDATA................................................................................................................................80MeghanE.Muse,CurtisL.Petersen,CarmenJ.Marsit,DianeGilbert-Diamond,BrockC.Christensen..80
DIRECTLYMEASURINGTHERATEANDDYNAMICSHUMANMUTATIONBYSEQUENCINGLARGE,MULTI-GENERATIONALPEDIGREES..........................................................................................................................81ThomasA.Sasani,BrentS.Pedersen,MarkLeppert,RayWhite,LisaBaird,AaronR.Quinlan,LynnB.Jorde..........................................................................................................................................................................81
AVAILABLEPROTEIN3DSTRUCTURESDONOTREFLECTHUMANGENETICANDFUNCTIONALDIVERSITY...............................................................................................................................................................................82GregorySliwoski,NeelPatel,R.MichaelSivley,CharlesR.Sanders,JensMeiler,WilliamS.Bush,JohnA.Capra..........................................................................................................................................................................82
SEMANTICWORKFLOWSFORBENCHMARKCHALLENGES:ENHANCINGCOMPARABILITY,REUSABILITYANDREPRODUCIBILITY......................................................................................................................83ArunimaSrivastava,RavaliAdusumilli,HunterBoyce,DanielGarijo,VarunRatnakar,RajivMayani,ThomasYu,RaghuMachiraju,YolandaGil,ParagMallick.............................................................83
PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA..................................................................................................................................................84CLASSPRIORESTIMATIONANDQUANTIFICATIONOFTHELOSSANDGAINOFRESIDUEFUNCTIONUPONMUTATION.........................................................................................................................................85ShantanuJain,JoseLugo-Martinez,MarthaWhite,MichaelW.Trosset,PredragRadivojac..........85
PREDICTIONOFTIMETOINSULINUSINGCLINICALANDGENETICBIOMARKERSINTYPE2DIABETESPATIENTS..........................................................................................................................................................86RikkeLinnemannNielsen,LouiseDonnelly,AgnesMartineNielsen,KonstantinosTsirigos,KaixinZhou,BjarneErsboell,LineClemmensen,EwanPearson,RamneekGupta................................................86
PATHOGENICITYANDFUNCTIONALIMPACTOFINSERTION/DELETIONANDSTOPGAINVARIATIONINTHEHUMANGENOME.......................................................................................................................87KymberleighA.Pagel,DannyAntaki,MatthewMort,DavidN.Cooper,JonathanSebat,LiliaM.Iakoucheva,SeanD.Mooney,PredragRadivojac...............................................................................................87
viii
DETECTINGPOTENTIALPLEIOTROPYACROSSCARDIOVASCULARANDNEUROLOGICALDISEASESUSINGUNIVARIATE,BIVARIATE,ANDMULTIVARIATEMETHODSON43,870INDIVIDUALSFROMTHEEMERGENETWORK......................................................................................................88XinyuanZhang,YogasudhaVeturi,ShefaliS.Verma,WilliamBone,AnuragVerma,AnastasiaM.Lucas,ScottHebbring,JoshuaC.Denny,IanStanaway,GailP.Jarvik,DavidCrosslin,EricB.Larson,LauraRasmussen-Torvik,SarahA.Pendergrass,JordanW.Smoller,HakonHakonarson,PatrickSleiman,ChunhuaWeng,DavidFasel,Wei-QiWei,IftikharKullo,DanielSchaid,WendyK.Chung,MarylynD.Ritchie................................................................................................................................................................88
PHARMGKB:THEAPIANDINFOBUTTONS......................................................................................................................89MichelleWhirl-Carrillo,RyanM.Whaley,MarkWoon,RussB.Altman,TeriE.Klein.......................89
SINGLECELLANALYSIS–WHATISINTHEFUTURE?...............................................................................90INTRATUMORHETEROGENEITY(ITH)METRICOFCIRCULATINGTUMORCELL(CTC)-DERIVEDXENOGRAFTMODELSINSMALLCELLLUNGCANCER.......................................................................................91YuanxinXi,C.AllisonStewart,CarlM.Gay,HaiTran,BonnieGlisson,JohnV.Heymach,PaulRobson,LaurenA.Byers,JingWang............................................................................................................................91
WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA.......................................................................................................................................92QUANTIFYINGTHEIDENTIFIABILITYOFINDIVIDUALSUSINGASPARSESETOFSNPS...................93PrashantS.Emani,GamzeGursoy,MarkB.Gerstein........................................................................................93
TRANSCRIPTOMICSUMMARYSPLICINGDATAMAYLEAKPERSONALPRIVATEINFORMATIONBYCOMPUTATIONALLINKAGETOTHEGENOMICVARIANTS.............................................................................94ZhiqiangHu,MarkB.Gerstein,StevenE.Brenner..............................................................................................94
WORKSHOPSMERGINGHETEROGENEOUSDATATOENABLEKNOWLEDGEDISCOVERY.....................................95TOSEARCHAHETNET...HOWARETWONODESCONNECTED?....................................................................96DanielHimmelstein,MichaelZietz,KyleKloster,MichaelNagle,BlairSullivan,CaseyS.Greene96
TEXTMININGANDMACHINELEARNINGFORPRECISIONMEDICINE.................................................97LITVAR:MININGGENOMICVARIANTSFROMBIOMEDICALLITERATUREFORDATABASECURATIONANDPRECISIONMEDICINE.....................................................................................................................98AlexisAllot,YifanPeng,Chih-HsuanWei,KyubumLee,LonPhan,ZhiyongLu......................................98
AUTHORINDEX.............................................................................................................................................99
1
PATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK
PROCEEDINGSPAPERSWITHORALPRESENTATIONS
2
THEEFFECTIVENESSOFMULTITASKLEARNINGFORPHENOTYPINGWITHELECTRONICHEALTHRECORDSDATA
DaisyYiDing1,ChloeSimpson1,StephenPfohl1,DaveC.Kale2,KennethJung1,NigamH.Shah1
1StanfordUniversity,2UniversityofSouthernCalifornia
Ding,DaisyYiElectronicphenotypingisthetaskofascertainingwhetheranindividualhasamedicalconditionofinterestbyanalyzingtheirmedicalrecordandisfoundationalinclinicalinformatics.Increasingly,electronicphenotypingisperformedviasupervisedlearning.Weinvestigatetheeffectivenessofmultitasklearningforphenotypingusingelectronichealthrecords(EHR)data.Multitasklearningaimstoimprovemodelperformanceonatargettaskbyjointlylearningadditionalauxiliarytasksandhasbeenusedindisparateareasofmachinelearning.However,itsutilitywhenappliedtoEHRdatahasnotbeenestablished,andpriorworksuggeststhatitsbenefitsareinconsistent.WepresentexperimentsthatelucidatewhenmultitasklearningwithneuralnetsimprovesperformanceforphenotypingusingEHRdatarelativetoneuralnetstrainedforasinglephenotypeandtowell-tunedbaselines.Wefindthatmultitaskneuralnetsconsistentlyoutperformsingle-taskneuralnetsforrarephenotypesbutunderperformforrelativelymorecommonphenotypes.Theeffectsizeincreasesasmoreauxiliarytasksareadded.Moreover,multitasklearningreducesthesensitivityofneuralnetstohyperparametersettingsforrarephenotypes.Last,wequantifyphenotypecomplexityandfindthatneuralnetstrainedwithorwithoutmultitasklearningdonotimproveonsimplebaselinesunlessthephenotypesaresufficientlycomplex.
3
ODAL:AONE-SHOTDISTRIBUTEDALGORITHMTOPERFORMLOGISTICREGRESSIONSONELECTRONICHEALTHRECORDSDATAFROMMULTIPLE
CLINICALSITES
RuiDuan,MaryReginaBoland,JasonH.Moore,YongChen
DepartmentofBiostatistics,Epidemiology&Informatics,UniversityofPennsylvaniaChen,YongElectronicHealthRecords(EHR)containextensiveinformationonvarioushealthoutcomesandriskfactors,andthereforehavebeenbroadlyusedinhealthcareresearch.IntegratingEHRdatafrommultipleclinicalsitescanaccelerateknowledgediscoveryandriskpredictionbyprovidingalargersamplesizeinamoregeneralpopulationwhichpotentiallyreducesclinicalbiasandimprovesestimationandpredictionaccuracy.Toovercomethebarrierofpatient-leveldatasharing,distributedalgorithmsaredevelopedtoconductstatisticalanalysesacrossmultiplesitesthroughsharingonlyaggregatedinformation.Thecurrentdistributedalgorithmoftenrequiresiterativeinformationevaluationandtransferringacrosssites,whichcanpotentiallyleadtoahighcommunicationcostinpracticalsettings.Inthisstudy,weproposeaprivacy-preservingandcommunication-efficientdistributedalgorithmforlogisticregressionwithoutrequiringiterativecommunicationsacrosssites.Oursimulationstudyshowedouralgorithmreachedcomparativeaccuracycomparingtotheoracleestimatorwheredataarepooledtogether.WeappliedouralgorithmtoanEHRdatafromtheUniversityofPennsylvaniahealthsystemtoevaluatetherisksoffetallossduetovariousmedicationexposures.
4
PVCDETECTIONUSINGACONVOLUTIONALAUTOENCODERANDRANDOMFORESTCLASSIFIER
MaxGordon,CranosWilliams
NorthCarolinaStateUniversityGordon,MaxTheaccuratedetectionofprematureventricularcontractions(PVCs)inpatientsisanimportanttaskincardiaccareforsomepatients.Insomecases,theusefulnesstophysiciansindetectingPVCsstemsfromtheirlong-termcorrelationswithdangerousheartconditions.Inothercasestheirpotentialasaprecursortoseriouscardiaceventsmaymaketheirdetectionausefulearlywarningmechanism.Inmanyoftheseapplications,thelong-termnatureofthemonitoringrequiredandtheinfrequencyofPVCsmakemanualobservationforPVCsimpractical.ExistingmethodsofautomatedPVCdetectionsufferfromdrawbackssuchastheneedtousedifficulttoextractmorphologicalfeatures,domain-specificfeatures,orlargenumbersofestimatedparameters.Inparticular,systemsusinglargenumbersoftrainedparametershavethepotentialtorequirelargeamountsoftrainingdataandcomputationandmayhaveissuesgeneralizingduetotheirpotentialtooverfit.Toaddresssomeofthesedrawbacks,wedevelopedanovelPVCdetectionalgorithmbasedaroundaconvolutionalautoencodertoaddresstheseweaknessesandvalidatedourmethodusingtheMIT-BIHarrhythmiadatabase.
5
PLATYPUS:AMULTIPLE–VIEWLEARNINGPREDICTIVEFRAMEWORKFORCANCERDRUGSENSITIVITYPREDICTION
KileyGraim,VerenaFriedl,KathleenE.Houlahan,JoshuaM.Stuart
Dept.ofBiomolecularEngineeringUniversityofCaliforniaSantaCruz,FlatironInstituteandPrincetonUniversity,OntarioInstituteofCancerResearchandUniversityofTorontoGraim,KileyCancerisacomplexcollectionofdiseasesthataretosomedegreeuniquetoeachpatient.Precisiononcologyaimstoidentifythebestdrugtreatmentregimeusingmoleculardataontumorsamples.Whileomics-leveldataisbecomingmorewidelyavailablefortumorspecimens,thedatasetsuponwhichcomputationallearningmethodscanbetrainedvaryincoveragefromsampletosampleandfromdatatypetodatatype.Methodsthatcan‘connectthedots’toleveragemoreoftheinformationprovidedbythesestudiescouldoffermajoradvantagesformaximizingpredictivepotential.Weintroduceamulti-viewmachine-learningstrategycalledPLATYPUSthatbuilds‘views’frommultipledatasourcesthatareallusedasfeaturesforpredictingpatientoutcomes.Weshowthatalearningstrategythatfindsagreementacrosstheviewsonunlabeleddataincreasestheperformanceofthelearningmethodsoveranysingleview.Weillustratethepoweroftheapproachbyderivingsignaturesfordrugsensitivityinalargecancercelllinedatabase.CodeandadditionalinformationareavailablefromthePLATYPUSwebsitehttps://sysbiowiki.soe.ucsc.edu/platypus.
6
DEEPDOM:PREDICTINGPROTEINDOMAINBOUNDARYFROMSEQUENCEALONEUSINGSTACKEDBIDIRECTIONALLSTM
YuexuJiang,DuolinWang,DongXu
DepartmentofElectricalEngineeringandComputerScience,BondLifeSciencesCenter,UniversityofMissouri,Columbia,Missouri65211,USAEmail:[email protected]
Jiang,YuexuProteindomainboundarypredictionisusuallyanearlysteptounderstandproteinfunctionandstructure.Mostofthecurrentcomputationaldomainboundarypredictionmethodssufferfromlowaccuracyandlimitationinhandlingmulti-domaintypes,orevencannotbeappliedoncertaintargetssuchasproteinswithdiscontinuousdomain.Wedevelopedanab-initioproteindomainpredictorusingastackedbidirectionalLSTMmodelindeeplearning.Ourmodelistrainedbyalargeamountofproteinsequenceswithoutusingfeatureengineeringsuchassequenceprofiles.Hence,thepredictionsusingourmethodismuchfasterthanothers,andthetrainedmodelcanbeappliedtoanytypeoftargetproteinswithoutconstraint.WeevaluatedDeepDombya10-foldcrossvalidationandalsobyapplyingitontargetsindifferentcategoriesfromCASP8andCASP9.ThecomparisonwithothermethodshasshownthatDeepDomoutperformsmostofthecurrentab-initiomethodsandevenachievesbetterresultsthanthetop-leveltemplate-basedmethodincertaincases.ThecodeofDeepDomandthetestdataweusedinCASP8,9canbeaccessedthroughGitHubathttps://github.com/yuexujiang/DeepDom.
7
IMPLEMENTINGANDEVALUATINGAGAUSSIANMIXTUREFRAMEWORKFORIDENTIFYINGGENEFUNCTIONFROMTNSEQDATA
KevinLi1,RachelChen2,WilliamLindsey3,AaronBest4,MatthewDeJongh4,ChristopherHenry5,NathanTintle3
1ColumbiaUniversity,2NorthCarolinaStateUniversity,3DordtCollege,4HopeCollege,
5ArgonneLaboratoryLi,KevinTherapidaccelerationofmicrobialgenomesequencingincreasesopportunitiestounderstandbacterialgenefunction.Unfortunately,onlyasmallproportionofgeneshavebeenstudied.Recently,TnSeqhasbeenproposedasacost-effective,highlyreliableapproachtopredictgenefunctionsasaresponsetochangesinacell’sfitnessbefore-aftergenomicchanges.However,majorquestionsremainabouthowtobestdeterminewhetheranobservedquantitativechangeinfitnessrepresentsameaningfulchange.Toaddressthelimitation,wedevelopaGaussianmixturemodelframeworkforclassifyinggenefunctionfromTnSeqexperiments.Inordertoimplementthemixturemodel,wepresenttheExpectation-MaximizationalgorithmandahierarchicalBayesianmodelsampledusingStan’sHamiltonianMonte-Carlosampler.WecomparetheseimplementationsagainstthefrequentistmethodusedincurrentTnSeqliterature.FromsimulationsandrealdataproducedbyE.coliTnSeqexperiments,weshowthattheBayesianimplementationoftheGaussianmixtureframeworkprovidesthemostconsistentclassificationresults.
8
RES2S2AM:DEEPRESIDUALNETWORK-BASEDMODELFORIDENTIFYINGFUNCTIONALNONCODINGSNPSINTRAIT-ASSOCIATEDREGIONS
ZhengLiu,YaoYao,QiWei,BenjaminWeeder,StephenA.Ramsey
OregonStateUniversityLiu,ZhengNoncodingsinglenucleotidepolymorphisms(SNPs)andtheirtargetgenesareimportantcomponentsoftheheritabilityofdiseasesandotherpolygenictraits.IdentifyingtheseSNPsandtargetgenescouldpotentiallyrevealnewmolecularmechanismsandadvanceprecisionmedicine.Forpolygenictraits,genome-wideassociationstudies(GWAS)arepreferredtoolsforidentifyingtrait-associatedregions.However,identifyingcausalnoncodingSNPswithinsuchregionsisadifficultproblemincomputationalbiology.TheDNAsequencecontextofanoncodingSNPiswell-establishedasanimportantsourceofinformationthatisbeneficialfordiscriminatingfunctionalfromnonfunctionalnoncodingSNPs.Wedescribetheuseofadeepresidualnetwork(ResNet)-basedmodel—entitledRes2s2aM—thatfusesflankingDNAsequenceinformationwithadditionalSNPannotationinformationtodiscriminatefunctionalfromnonfunctionalnoncodingSNPs.Onaground-truthsetofdisease-associatedSNPscompiledfromtheGenome-wideRepositoryofAssociationsbetweenSNPsandPhenotypes(GRASP)database,Res2s2aMimprovesthepredictionaccuracyoffunctionalSNPssignificantlyincomparisontomodelsbasedonlyonsequenceinformationaswellasaleadingtoolforpost-GWASnoncodingSNPprioritization(RegulomeDB).
9
BI-DIRECTIONALRECURRENTNEURALNETWORKMODELSFORGEOGRAPHICLOCATIONEXTRACTIONINBIOMEDICALLITERATURE
ArjunMagge1,DavyWeissenbacher2,AbeedSarker2,MatthewScotch1,GracielaGonzalez-Hernandez2
1ArizonaStateUniversity,2UniversityofPennsylvania
Magge,ArjunPhylogeographyresearchinvolvingvirusspreadandtreereconstructionreliesonaccurategeographiclocationsofinfectedhosts.InsufficientlevelofgeographicinformationinnucleotidesequencerepositoriessuchasGenBankmotivatestheuseofnaturallanguageprocessingmethodsforextractinggeographiclocationnames(toponyms)inthescientificarticleassociatedwiththesequence,anddisambiguatingthelocationstotheirco-ordinates.Inthispaper,wepresentanextensivestudyofmultiplerecurrentneuralnetworkarchitecturesforthetaskofextractinggeographiclocationsandtheireffectivecontributiontothedisambiguationtaskusingpopulationheuristics.ThemethodspresentedinthispaperachieveastrictdetectionF-1scoreof0.94,disambiguationaccuracyof91%andanoverallresolutionF-1scoreof0.88thataresignificantlyhigherthanpreviouslydevelopedmethods,improvingourcapabilitytofindthelocationofinfectedhostsandenrichmetadatainformation.
10
COMPUTATIONALKIRCOPYNUMBERDISCOVERYREVEALSINTERACTIONBETWEENINHIBITORYRECEPTORBURDENANDSURVIVAL
RachelM.Pyke1,RaphaelGenolet2,AlexandreHarari2,GeorgeCoukos2,DavidGfeller2,HannahCarter1
1UniversityofCalifornia-SanDiego,2LudwigInstituteforCancerResearch-Universityof
LausannePyke,RachelM.Naturalkiller(NK)cellshaveincreasinglybecomeatargetofinterestforimmunotherapies1.NKcellsexpresskillerimmunoglobulin-likereceptors(KIRs),whichplayavitalroleinimmuneresponsetotumorsbydetectingcellularabnormalities.Thegenomicregionencodingthe16KIRgenesdisplayshighpolymorphicvariabilityinhumanpopulations,makingitdifficulttoresolveindividualgenotypesbasedonnextgenerationsequencingdata.Asaresult,theimpactofpolymorphicKIRvariationoncancerphenotypeshasbeenunderstudied.Currently,labor-intensive,experimentaltechniquesareusedtodetermineanindividual’sKIRgenecopynumberprofile.Here,wedevelopanalgorithmtodeterminethegermlinecopynumberofKIRgenesfromwholeexomesequencingdataandapplyittoacohortofnearly5000cancerpatients.Weuseak-merbasedapproachtocapturesequencesuniquetospecificgenes,counttheiroccurrencesinthesetofreadsderivedfromanindividualandcomparetheindividual’sk-merdistributiontothatofthepopulation.Copynumberresultsdemonstratehighconcordancewithpopulationcopynumberexpectations.OurmethodrevealsthattheburdenofinhibitoryKIRgenesisassociatedwithsurvivalintwotumortypes,highlightingthepotentialimportanceofKIRvariationinunderstandingtumordevelopmentandresponsetoimmunotherapy.
11
SEMANTICWORKFLOWSFORBENCHMARKCHALLENGES:ENHANCINGCOMPARABILITY,REUSABILITYANDREPRODUCIBILITY
ArunimaSrivastava1,RavaliAdusumilli2,HunterBoyce2,DanielGarijo3,VarunRatnakar3,RajivMayani3,ThomasYu4,RaghuMachiraju1,YolandaGil3,ParagMallick2
1TheOhioStateUniversity,2StanfordUniversity,3UniversityofSouthernCalifornia,4Sage
BionetworksSrivastava,ArunimaBenchmarkchallenges,suchastheCriticalAssessmentofStructurePrediction(CASP)andDialogueforReverseEngineeringAssessmentsandMethods(DREAM)havebeeninstrumentalindrivingthedevelopmentofbioinformaticsmethods.Typically,challengesareposted,andthencompetitorsperformapredictionbaseduponblindedtestdata.Challengersthensubmittheiranswerstoacentralserverwheretheyarescored.RecenteffortstoautomatethesechallengeshavebeenenabledbysystemsinwhichchallengerssubmitDockercontainers,aunitofsoftwarethatpackagesupcodeandallofitsdependencies,toberunonthecloud.Despitetheirincrediblevalueforprovidinganunbiasedtest-bedforthebioinformaticscommunity,thereremainopportunitiestofurtherenhancethepotentialimpactofbenchmarkchallenges.Specifically,currentapproachesonlyevaluateend-to-endperformance;itisnearlyimpossibletodirectlycomparemethodologiesorparameters.Furthermore,thescientificcommunitycannoteasilyreusechallengers’approaches,duetolackofspecifics,ambiguityintoolsandparametersaswellasproblemsinsharingandmaintenance.Lastly,theintuitionbehindwhyparticularstepsareusedisnotcaptured,astheproposedworkflowsarenotexplicitlydefined,makingitcumbersometounderstandtheflowandutilizationofdata.HereweintroduceanapproachtoovercometheselimitationsbasedupontheWINGSsemanticworkflowsystem.Specifically,WINGSenablesresearcherstosubmitcompletesemanticworkflowsaschallengesubmissions.Bysubmittingentriesasworkflows,itthenbecomespossibletocomparenotjusttheresultsandperformanceofachallenger,butalsothemethodologyemployed.Thisisparticularlyimportantwhendozensofchallengeentriesmayusenearlyidenticaltools,butwithonlysubtlechangesinparameters(andradicaldifferencesinresults).WINGSusesacomponentdrivenworkflowdesignandoffersintelligentparameteranddataselectionbyreasoningaboutdatacharacteristics.Thisprovestobeespeciallycriticalinbioinformaticsworkflowswhereusingdefaultorincorrectparametervaluesispronetodrasticallyalteringresults.Differentchallengeentriesmaybereadilycomparedthroughtheuseofabstractworkflows,whichalsofacilitatereuse.WINGSishousedonacloudbasedsetup,whichstoresdata,dependenciesandworkflowsforeasysharingandutility.ItalsohastheabilitytoscaleworkflowexecutionsusingdistributedcomputingthroughthePegasusworkflowexecutionsystem.WedemonstratetheapplicationofthisarchitecturetotheDREAMproteogenomicchallenge.
12
REMOVINGCONFOUNDINGFACTORSASSOCIATEDWEIGHTSINDEEPNEURALNETWORKSIMPROVESTHEPREDICTIONACCURACYFORHEALTHCARE
APPLICATIONS
HaohanWang1,ZhenglinWu2,EricP.Xing3
1CarnegieMellonUniversity,2UniversityofIllinoisUrbana-Champaign,3CarnegieMellonUniversity
Wang,HaohanTheproliferationofhealthcaredatahasbroughttheopportunitiesofapplyingdata-drivenapproaches,suchasmachinelearningmethods,toassistdiagnosis.Recently,manydeeplearningmethodshavebeenshownwithimpressivesuccessesinpredictingdiseasestatuswithrawinputdata.However,the``black-box''natureofdeeplearningandthehigh-reliabilityrequirementofbiomedicalapplicationshavecreatednewchallengesregardingtheexistenceofconfoundingfactors.Inthispaper,withabriefargumentthatinappropriatehandlingofconfoundingfactorswillleadtomodels'sub-optimalperformanceinreal-worldapplications,wepresentanefficientmethodthatcanremovetheinfluencesofconfoundingfactorssuchasageorgendertoimprovetheacross-cohortpredictionaccuracyofneuralnetworks.Onedistinctadvantageofourmethodisthatitonlyrequiresminimalchangesofthebaselinemodel'sarchitecturesothatitcanbepluggedintomostoftheexistingneuralnetworks.WeconductexperimentsacrossCT-scan,MRA,andEEGbrainwavewithconvolutionalneuralnetworksandLSTMtoverifytheefficiencyofourmethod.
13
PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA
PROCEEDINGSPAPERSWITHORALPRESENTATIONS
14
ANOPTIMALPOLICYFORPATIENTLABORATORYTESTSININTENSIVECAREUNITS
Li-FangCheng,NiranjaniPrasad,BarbaraE.Engelhardt
PrincetonUniversityPrasad,NiranjaniLaboratorytestingisanintegraltoolinthemanagementofpatientcareinhospitals,particularlyinintensivecareunits(ICUs).Thereexistsaninherenttrade-offintheselectionandtimingoflabtestsbetweenconsiderationsoftheexpectedutilityinclinicaldecision-makingofagiventestataspecifictime,andtheassociatedcostorriskitposestothepatient.Inthiswork,weintroduceaframeworkthatlearnspoliciesfororderinglabtestswhichoptimizesforthistrade-off.Ourapproachusesbatchoff-policyreinforcementlearningwithacompositerewardfunctionbasedonclinicalimperatives,appliedtodatathatincludeexamplesofcliniciansorderinglabsforpatients.Tothisend,wedevelopandextendprinciplesofParetooptimalitytoimprovetheselectionofactionsbasedonmultiplerewardfunctioncomponentswhilerespectingtypicalproceduralconsiderationsandprioritizationofclinicalgoalsintheICU.Ourexperimentsshowthatwecanestimateapolicythatreducesthefrequencyoflabtestsandoptimizestimingtominimizeinformationredundancy.Wealsofindthattheestimatedpoliciestypicallysuggestorderinglabtestswellaheadofcriticalonsets---suchasmechanicalventilationordialysis---thatdependonthelabresults.Weevaluateourapproachbyquantifyinghowthesepoliciesmayinitiateearlieronsetoftreatment.
15
CROWDVARIANT:ACROWDSOURCINGAPPROACHTOCLASSIFYCOPYNUMBERVARIANTS
PeytonGreenside1,JustinZook2,MarcSalit3,RyanPoplin4,MadeleineCule5,MarkDePristo4
1StanfordUniversity,2NationalInstituteofStandardsandTechnologies(NIST),3NationalInstituteofStandardsandTechnologies(NIST)/JointInitiativeforMetrologyinBiology
(JIMB),4GoogleInc./VerilyLifeSciences,5Calico/VerilyLifeSciencesGreenside,PeytonCopynumbervariants(CNVs)areanimportanttypeofgeneticvariationthatplayacausalroleinmanydiseases.TheabilitytoidentifyhighqualityCNVsisofsubstantialclinicalrelevance.However,CNVsarenotoriouslydifficulttoidentifyaccuratelyfromarray-basedmethodsandnext-generationsequencing(NGS)data,particularlyforsmall(<10kbp)CNVs.Manualcurationbyexpertswidelyremainsthegoldstandardbutcannotscalewiththepaceofsequencing,particularlyinfast-growingclinicalapplications.Wepresentthefirstproof-of-principlestudydemonstratinghighthroughputmanualcurationofputativeCNVsbynon-experts.Wedevelopedacrowdsourcingframework,calledCrowdVariant,thatleveragesGoogle'shigh-throughputcrowdsourcingplatformtocreateahighconfidencesetofdeletionsforNA24385(NISTHG002/RM8391),anAshkenazimreferencesampledevelopedinpartnershipwiththeGenomeInABottle(GIAB)Consortium.Weshowthatnon-expertstendtoagreebothwitheachotherandwithexpertsonputativeCNVs.Weshowthatcrowdsourcednon-expertclassificationscanbeusedtoaccuratelyassigncopynumberstatustoputativeCNVcallsandidentify1,781highconfidencedeletionsinareferencesample.MultiplelinesofevidencesuggestthesecallsareasubstantialimprovementoverexistingCNVcallsetsandcanalsobeusefulinbenchmarkingandimprovingCNVcallingalgorithms.OurcrowdsourcingmethodologytakesthefirststeptowardshowingtheclinicalpotentialformanualcurationofCNVsatscaleandcanfurtherguideothercrowdsourcinggenomicsapplications.
16
AREPOSITORYOFMICROBIALMARKERGENESRELATEDTOHUMANHEALTHANDDISEASESFORHOSTPHENOTYPEPREDICTIONUSINGMICROBIOMEDATA
WontackHan,YuzhenYe
IndianaUniversityHan,WontackThemicrobiomeresearchisgoingthroughanevolutionarytransitionfromfocusingonthecharacterizationofreferencemicrobiomesassociatedwithdifferentenvironments/hoststothetranslationalapplications,includingusingmicrobiomefordiseasediagnosis,improvingtheefficacyofcancertreatments,andpreventionofdiseases(e.g.,usingprobiotics).Microbialmarkershavebeenidentifiedfrommicrobiomedataderivedfromcohortsofpatientswithdifferentdiseases,treatmentresponsiveness,etc,andoftenpredictorsbasedonthesemarkerswerebuiltforpredictinghostphenotypegivenamicrobiomedataset(e.g.,topredictifapersonhastype2diabetesgivenhisorhermicrobiomedata).Unfortunately,thesemicrobialmarkersandpredictorsareoftennotpublishedsoarenotreusablebyothers.Inthispaper,wereportthecurationofarepositoryofmicrobialmarkergenesandpredictorsbuiltfromthesemarkersformicrobiome-basedpredictionofhostphenotype,andacomputationalpipelinecalledMi2P(fromMicrobiometoPhenotype)forusingtherepository.Asaninitialeffort,wefocusonmicrobialmarkergenesrelatedtotwodiseases,type2diabetesandlivercirrhosis,andimmunotherapyefficacyfortwotypesofcancer,non-small-celllungcancer(NSCLC)andrenalcellcarcinoma(RCC).Wecharacterizedthemarkergenesfrommetagenomicdatausingourrecentlydevelopedsubtractiveassemblyapproach.Weshowedthatpredictorsbuiltfromthesemicrobialmarkergenescanprovidefastandreasonablyaccuratepredictionofhostphenotypegivenmicrobiomedata.Asunderstandingandmakinguseofmicrobiomedata(oursecondgenome)isbecomingvitalaswemoveforwardinthisageofprecisionhealthandprecisionmedicine,webelievethatsucharepositorywillbeusefulforenablingtranslationalapplicationsofmicrobiomedata.
17
AICM:AGENUINEFRAMEWORKFORCORRECTINGINCONSISTENCYBETWEENLARGEPHARMACOGENOMICSDATASETS
ZhiyueTomHu1,YutingYe1,PatrickA.Newbury2,HaiyanHuang2,3,4,BinChen5
1UniversityofCaliforniaBerkeley,DepartmentofBiostatistics;1UniversityofCaliforniaBerkeley,DepartmentofBiostatistics;2UniversityofCaliforniaBerkeley,Departmentof
PediatricsandHumanDevelopment;3MichiganStateUniversity,DepartmentofStatistics,4UniversityofCaliforniaBerkeley,DepartmentofPharmacologyand
Toxicology;5MichiganStateUniversityhu,ZhiyueTheinconsistencyofopenpharmacogenomicsdatasetsproducedbydifferentstudieslimitstheusageofsuchdatasetsinmanytasks,suchasbiomarkerdiscovery.Investigationofmultiplepharmacogenomicsdatasetsconfirmedthatthepairwisesensitivitydatacorrelationbetweendrugs,orrows,acrossdifferentstudies(drug-wise)isrelativelylow,whilethepairwisesensitivitydatacorrelationbetweencell-lines,orcolumns,acrossdifferentstudies(cell-wise)isconsiderablystrong.Thiscommoninterestingobservationacrossmultiplepharmacogenomicsdatasetssuggeststheexistenceofsubtleconsistencyamongthedifferentstudies(i.e.,strongcell-wisecorrelation).However,significantnoisesarealsoshown(i.e.,weakdrug-wisecorrelation)andhavepreventedresearchersfromcomfortablyusingthedatadirectly.Motivatedbythisobservation,weproposeanovelframeworkforaddressingtheinconsistencybetweenlarge-scalepharmacogenomicsdatasets.Ourmethodcansignificantlyboostthedrug-wisecorrelationandcanbeeasilyappliedtore-summarizedandnormalizeddatasetsproposedbyothers.Wealsoinvestigateouralgorithmbasedonmanydifferentcriteriatodemonstratethatthecorrecteddatasetsarenotonlyconsistent,butalsobiologicallymeaningful.Eventually,weproposetoextendourmainalgorithmintoaframework,sothatinthefuturewhenmoredatasetsbecomepubliclyavailable,ourframeworkcanhopefullyoffera"ground-truth"guidanceforreferences.
18
INTEGRATINGRNAEXPRESSIONANDVISUALFEATURESFORIMMUNEINFILTRATEPREDICTION
DerekReiman1,LingdaoSha1,IrvinHo1,TimothyTan2,DeniseLau1,AlyA.Khan3
1TempusLabs,2NorthwesternUniversity,3ToyotaTechnologicalInstituteatChicagoKhan,AlyPatientresponsestocancerimmunotherapyareshapedbytheiruniquegenomiclandscapeandtumormicroenvironment.Clinicaladvancesinimmunotherapyarechangingthetreatmentlandscapebyenhancingapatient'simmuneresponsetoeliminatecancercells.Whilethisprovidespotentiallybeneficialtreatmentoptionsformanypatients,onlyaminorityofthesepatientsrespondtoimmunotherapy.Inthiswork,weexaminedRNA-seqdataanddigitalpathologyimagesfromindividualpatienttumorstomoreaccuratelycharacterizethetumor-immunemicroenvironment.Severalstudiesimplicateaninflamedmicroenvironmentandincreasedpercentageoftumorinfiltratingimmunecellswithbetterresponsetospecificimmunotherapiesincertaincancertypes.WedevelopedNEXT(Neural-basedmodelsforintegratinggeneEXpressionandvisualTexturefeatures)tomoreaccuratelymodelimmuneinfiltrationinsolidtumors.TodemonstratetheutilityoftheNEXTframework,wepredictedimmuneinfiltratesacrossfourdifferentcancertypesandevaluatedourpredictionsagainstexpertpathologyreview.Ouranalysesdemonstratethatintegrationofimagingfeaturesimprovespredictionoftheimmuneinfiltrate.Ofnote,thiseffectwaspreferentiallyobservedforBcellsandCD8Tcells.Insum,ourworkeffectivelyintegratesbothRNA-seqandimagingdatainaclinicalsettingandprovidesamorereliableandaccuratepredictionoftheimmunecompositioninindividualpatienttumors.
19
OUTGROUPMACHINELEARNINGAPPROACHIDENTIFIESSINGLENUCLEOTIDEVARIANTSINNONCODINGDNAASSOCIATEDWITHAUTISMSPECTRUM
DISORDER
MayaVarma,KelleyMariePaskov,Jae-YoonJung,BriannaSierraChrisman,NateTylerStockham,PeterYigitcanWashington,DennisPaulWall
StanfordUniversity
Varma,MayaAutismspectrumdisorder(ASD)isaheritableneurodevelopmentaldisorderaffecting1in59children.Whilenoncodinggeneticvariationhasbeenshowntoplayamajorroleinmanycomplexdisorders,thecontributionoftheseregionstoASDsusceptibilityremainsunclear.GeneticanalysesofASDtypicallyuseunaffectedfamilymembersascontrols;however,wehypothesizethatthismethoddoesnoteffectivelyelevatevariantsignalinthenoncodingregionduetofamilymembershavingsubclinicalphenotypesarisingfromcommongeneticmechanisms.Inthisstudy,weuseaseparate,unrelatedoutgroupofindividualswithprogressivesupranuclearpalsy(PSP),aneurodegenerativeconditionwithnoknownetiologicaloverlapwithASD,asacontrolpopulation.Weusewholegenomesequencingdatafromalargecohortof2182childrenwithASDand379controlswithPSP,sequencedatthesamefacilitywiththesamemachinesandvariantcallingpipeline,inordertoinvestigatetheroleofnoncodingvariationintheASDphenotype.Weanalyzesevenmajortypesofnoncodingvariants:microRNAs,humanacceleratedregions,hypersensitivesites,transcriptionfactorbindingsites,DNArepeatsequences,simplerepeatsequences,andCpGislands.Afteridentifyingandremovingbatcheffectsbetweenthetwogroups,wetrainedanl1-regularizedlogisticregressionclassifiertopredictASDstatusfromeachsetofvariants.Theclassifiertrainedonsimplerepeatsequencesperformedwellonaheld-outtestset(AUC-ROC=0.960);thisclassifierwasalsoabletodifferentiateASDcasesfromcontrolswhenappliedtoacompletelyindependentdataset(AUC-ROC=0.960).ThissuggeststhatvariationinsimplerepeatregionsispredictiveoftheASDphenotypeandmaycontributetoASDrisk.Ourresultsshowtheimportanceofthenoncodingregionandtheutilityofindependentcontrolgroupsineffectivelylinkinggeneticvariationtodiseasephenotypeforcomplexdisorders.
20
PRECISIONDRUGREPURPOSINGVIACONVERGENTEQTL-BASEDMOLECULESANDPATHWAYTARGETINGINDEPENDENTDISEASE-ASSOCIATED
POLYMORPHISMS
FrancescaVitali1,2,JoanneBerghout1,2,3,JungweiFan1,2,JianrongLi1,QikeLi1,HaiquanLi1,2,4,YvesA.Lussier1,2,3,5
1CenterforBiomedicalInformaticsandBiostatistics(CB2)ofTheUniversityofArizona,2DepartmentofMedicineCOM-TofTheUniversityofArizona,3TheCenterforApplied
GeneticsandGenomicsinMedicineofTheUniversityofArizona,4DepartmentofBiosystemsEngineeringofTheUniversityofArizona,5UACancerCenterUAHealth
Science(UAHS)ofTheUniversityofArizonaVitali,FrancescaRepurposingexistingdrugsfornewtherapeuticindicationscanimprovesuccessratesandstreamlinedevelopment.Useoflarge-scalebiomedicaldatarepositories,includingeQTLregulatoryrelationshipsandgenome-widediseaseriskassociations,offersopportunitiestoproposenovelindicationsfordrugstargetingcommonorconvergentmolecularcandidatesassociatedtotwoormorediseases.Thisproposednovelcomputationalapproachscalesacross262complexdiseases,buildingamulti-partitehierarchicalnetworkintegrating(i)GWAS-derivedSNP-to-diseaseassociations,(ii)eQTL-derivedSNP-to-eGeneassociationsincorporatingbothcis-andtrans-relationshipsfrom19tissues,(iii)proteintarget-to-drug,and(iv)drug-to-diseaseindicationswith(iv)GeneOntology-basedinformationtheoreticsemantic(ITS)similaritycalculatedbetweenproteintargetfunctions.OurhypothesisisthatiftwodiseasesareassociatedtoacommonorfunctionallysimilareGene-andadrugtargetingthateGene/proteininonediseaseexists-theseconddiseasebecomesapotentialrepurposingindication.Toexplorethis,allpossiblepairsofindependentlysegregatingGWAS-derivedSNPsweregenerated,andastatisticalnetworkofsimilaritywithineachSNP-SNPpairwascalculatedaccordingtoscale-freeoverrepresentationofconvergentbiologicalprocessesactivityinregulatedeGenes(ITSeGENE-eGENE)andscale-freeoverrepresentationofcommoneGenetargetsbetweenthetwoSNPs(ITSSNP-SNP).SignificanceofITSSNP-SNPwasconservativelyestimatedusingempiricalscale-freepermutationresamplingkeepingthenode-degreeconstantforeachmoleculeineachpermutation.Weidentified26newdrugrepurposingindicationcandidatesspanning89GWASdiseases,includingapotentialrepurposingofthecalcium-channelblockerVerapamilfromcoronarydiseasetogout.PredictionsfromourapproacharecomparedtoknowndrugindicationsusingDrugBankasagoldstandard(oddsratio=13.1,p-value=2.49x10-8).Becauseofspecificdisease-SNPsassociationstocandidatedrugtargets,theproposedmethodprovidesevidenceforfutureprecisiondrugrepositioningtoapatient’sspecificpolymorphisms.
21
DETECTINGPOTENTIALPLEIOTROPYACROSSCARDIOVASCULARANDNEUROLOGICALDISEASESUSINGUNIVARIATE,BIVARIATE,ANDMULTIVARIATE
METHODSON43,870INDIVIDUALSFROMTHEEMERGENETWORK
XinyuanZhang1,YogasudhaVeturi1,ShefaliS.Verma1,WilliamBone1,AnuragVerma1,AnastasiaM.Lucas1,ScottHebbring2,JoshuaC.Denny3,IanStanaway4,GailP.Jarvik4,DavidCrosslin4,EricB.Larson5,LauraRasmussen-Torvik6,SarahA.Pendergrass7,JordanW.Smoller8,HakonHakonarson9,PatrickSleiman9,ChunhuaWeng10,DavidFasel10,Wei-
QiWei3,IftikharKullo11,DanielSchaid11,WendyK.Chung10,MarylynD.Ritchie1
1UniversityofPennsylvania,2MarshfieldClinic,3VanderbiltUniversity,4UniversityofWashington,5KaiserPermanenteWashingtonHealthResearchInstitute,6Northwestern
University,7GeisingerHealthSystem,8MassachusettsGeneralHospital,9Children'sHospitalofPhiladelphia,10ColumbiaUniversity,11MayoClinic
Zhang,XinyuanThelinkbetweencardiovasculardiseasesandneurologicaldisordershasbeenwidelyobservedintheagingpopulation.Diseasepreventionandtreatmentrelyonunderstandingthepotentialgeneticnexusofmultiplediseasesinthesecategories.Inthisstudy,wewereinterestedindetectingpleiotropy,orthephenomenoninwhichageneticvariantinfluencesmorethanonephenotype.Marker-phenotypeassociationapproachescanbegroupedintounivariate,bivariate,andmultivariatecategoriesbasedonthenumberofphenotypesconsideredatonetime.HereweappliedonestatisticalmethodpercategoryfollowedbyaneQTLcolocalizationanalysistoidentifypotentialpleiotropicvariantsthatcontributetothelinkbetweencardiovascularandneurologicaldiseases.Weperformedouranalyseson~530,000commonSNPscoupledwith65electronichealthrecord(EHR)-basedphenotypesin43,870unrelatedEuropeanadultsfromtheElectronicMedicalRecordsandGenomics(eMERGE)network.Therewere31variantsidentifiedbyallthreemethodsthatshowedsignificantassociationsacrosslateonsetcardiac-andneurologic-diseases.Wefurtherinvestigatedfunctionalimplicationsofgeneexpressiononthedetected“leadSNPs”viacolocalizationanalysis,providingadeeperunderstandingofthediscoveredassociations.Insummary,wepresenttheframeworkandlandscapefordetectingpotentialpleiotropyusingunivariate,bivariate,multivariate,andcolocalizationmethods.Furtherexplorationofthesepotentiallypleiotropicgeneticvariantswillworktowardunderstandingdiseasecausingmechanismsacrosscardiovascularandneurologicaldiseasesandmayassistinconsideringdiseasepreventionaswellasdrugrepositioninginfutureresearch.
22
SINGLECELLANALYSIS–WHATISTHEFUTURE?
PROCEEDINGSPAPERSWITHORALPRESENTATIONS
23
LISA:ACCURATERECONSTRUCTIONOFCELLTRAJECTORYANDPSEUDO-TIMEFORMASSIVESINGLECELLRNA-SEQDATA
YangChen1,YupingZhang2,ZhengqingOuyang1
1TheJacksonLaboratoryforGenomicMedicine,2UniversityofConnecticutOuyang,ZhengqingCelltrajectoryreconstructionbasedonsinglecellRNAsequencingisimportantforobtainingthelandscapeofdifferentcelltypesanddiscoveringcellfatetransitions.Despiteintenseeffort,analyzingmassivesinglecellRNA-seqdatasetsisstillchallenging.WeproposeanewmethodnamedLandmarkIsomapforSingle-cellAnalysis(LISA).LISAisanunsupervisedapproachtobuildcelltrajectoryandcomputepseudo-timeintheisometricembeddingbasedongeodesicdistances.TheadvantagesofLISAinclude:(1)Itutilizesk-nearest-neighborgraphandhierarchicalclusteringtoidentifycellclusters,peaksandvalleysinlow-dimensionrepresentationofthedata;(2)BasedonLandmarkIsomap,itconstructsthemaingeometricstructureofcelllineages;(3)Itprojectscellstotheedgesofthemaincelltrajectorytogeneratetheglobalpseudo-time.AssessmentsonsimulatedandrealdatasetsdemonstratetheadvantagesofLISAoncelltrajectoryandpseudo-timereconstructioncomparedtoMonocle2andTSCAN.LISAisaccurate,fast,andrequireslessmemoryusage,allowingitsapplicationstomassivesinglecelldatasetsgeneratedfromcurrentexperimentalplatforms.
24
PARAMETERTUNINGISAKEYPARTOFDIMENSIONALITYREDUCTIONVIADEEPVARIATIONALAUTOENCODERSFORSINGLECELLRNATRANSCRIPTOMICS
QiwenHu,CaseyS.Greene
UniversityofPennsylvaniaHu,QiwenSingle-cellRNAsequencing(scRNA-seq)isapowerfultooltoprofilethetranscriptomesofalargenumberofindividualcellsatahighresolution.Thesedatausuallycontainmeasurementsofgeneexpressionformanygenesinthousandsortensofthousandsofcells,thoughsomedatasetsnowreachthemillion-cellmark.Projectinghigh-dimensionalscRNA-seqdataintoalowdimensionalspaceaidsdownstreamanalysisanddatavisualization.Manyrecentpreprintsaccomplishthisusingvariationalautoencoders(VAE),generativemodelsthatlearnunderlyingstructureofdatabycompressitintoaconstrained,lowdimensionalspace.ThelowdimensionalspacesgeneratedbyVAEshaverevealedcomplexpatternsandnovelbiologicalsignalsfromlarge-scalegeneexpressiondataanddrugresponsepredictions.Here,weevaluateasimpleVAEapproachforgeneexpressiondata,Tybalt,bytrainingandmeasuringitsperformanceonsetsofsimulatedscRNA-seqdata.Wefindanumberofcounter-intuitiveperformancefeatures:i.e.,deeperneuralnetworkscanstrugglewhendatasetscontainmoreobservationsundersomeparameterconfigurations.Weshowthatthesemethodsarehighlysensitivetoparametertuning:whentuned,theperformanceoftheTybaltmodel,whichwasnotoptimizedforscRNA-seqdata,outperformsotherpopulardimensionreductionapproaches–PCA,ZIFA,UMAPandt-SNE.Ontheotherhand,withouttuningperformancecanalsoberemarkablypooronthesamedata.Ourresultsshoulddiscourageauthorsandreviewersfromrelyingonself-reportedperformancecomparisonstoevaluatetherelativevalueofcontributionsinthisareaatthistime.Instead,werecommendthatattemptstocompareorbenchmarkautoencodermethodsforscRNA-seqdatabeperformedbydisinterestedthirdpartiesorbymethodsdevelopersonlyonunseenbenchmarkdatathatareprovidedtoallparticipantssimultaneouslybecausethepotentialforperformancedifferencesduetounequalparametertuningissohigh.
25
TOPOLOGICALMETHODSFORVISUALIZATIONANDANALYSISOFHIGHDIMENSIONALSINGLE-CELLRNASEQUENCINGDATA
TongxinWang1,TravisJohnson2,JieZhang3,KunHuang4,5
1DepartmentofComputerScience,IndianaUniversityBloomington;2DepartmentofBiomedicalInforamtics,OhioStateUniversity;3DepartmentofMedicalandMolecularGenetics,IndianaUniversitySchoolofMedicine;4DepartmentofMedicine,Indiana
UniversitySchoolofMedicine;5RegenstriefInstituteWang,TongxinSingle-cellRNAsequencing(scRNA-seq)techniqueshavebeenverypowerfulinanalyzingheterogeneouscellpopulationandidentifyingcelltypes.VisualizingscRNA-seqdatacanhelpresearcherseffectivelyextractmeaningfulbiologicalinformationandmakenewdiscoveries.WhilecommonlyusedscRNA-seqvisualizationmethods,suchast-SNE,areusefulindetectingcellclusters,theyoftentearaparttheintrinsiccontinuousstructureingeneexpressionprofiles.TopologicalDataAnalysis(TDA)approacheslikeMappercapturetheshapeofdatabyrepresentingdataastopologicalnetworks.TDAapproachesarerobusttonoiseanddifferentplatforms,whilepreservingthelocalityanddatacontinuity.Moreover,insteadofanalyzingthewholedataset,Mapperallowsresearcherstoexplorebiologicalmeaningsofspecificpathwaysandgenesbyusingdifferentfilterfunctions.Inthispaper,weappliedMappertovisualizescRNA-seqdata.Ourmethodcannotonlycapturetheclusteringstructureofcells,butalsopreservethecontinuousgeneexpressiontopologiesofcells.Wedemonstratedthatbycombiningwithgeneco-expressionnetworkanalysis,ourmethodcanrevealdifferentialexpressionpatternsofgeneco-expressionmodulesalongtheMappervisualization.
26
WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA
PROCEEDINGSPAPERSWITHORALPRESENTATIONS
27
LEVERAGINGSUMMARYSTATISTICSTOMAKEINFERENCESABOUTCOMPLEXPHENOTYPESINLARGEBIOBANKS
AngelaGasdaska1,DerekFriend2,RachelChen3,JasonWestra4,MatthewZawistowski5,WilliamLindsey4,NathanTintle4
1EmoryUniversity,2UniversityofNevadaReno,3NorthCarolinaStateUniversity,4Dordt
College,5UniversityofMichiganAnnArborTintle,NathanAsgeneticsequencingbecomeslessexpensiveanddatasetslinkinggeneticdataandmedicalrecords(e.g.,Biobanks)becomelargerandmorecommon,issuesofdataprivacyandcomputationalchallengesbecomemorenecessarytoaddressinordertorealizethebenefitsofthesedatasets.Onepossibilityforalleviatingtheseissuesisthroughtheuseofalready-computedsummarystatistics(e.g.,slopesandstandarderrorsfromaregressionmodelofaphenotypeonagenotype).Ifgroupssharesummarystatisticsfromtheiranalysesofbiobanks,manyoftheprivacyissuesandcomputationalchallengesconcerningtheaccessofthesedatacouldbebypassed.Inthispaperweexplorethepossibilityofusingsummarystatisticsfromsimplelinearmodelsofphenotypeongenotypeinordertomakeinferencesaboutmorecomplexphenotypes(thosethatarederivedfromtwoormoresimplephenotypes).Weprovideexactformulasfortheslope,intercept,andstandarderroroftheslopeforlinearregressionswhencombiningphenotypes.Derivedequationsarevalidatedviasimulationandtestedonarealdatasetexploringthegeneticsoffattyacids.
28
EVALUATIONOFPATIENTRE-IDENTIFICATIONUSINGLABORATORYTESTORDERSANDMITIGATIONVIALATENTSPACEVARIABLES
KippW.Johnson1,JessicaK.DeFreitas1,BenjaminS.Glicksberg1,JasonR.Bobe1,JoelT.Dudley2
1InstituteforNextGenerationHealthcare-DepartmentofGeneticsandGenomicsSciences-IcahnSchoolofMedicineatMountSinai,2BakarComputationalHealth
SciencesInstituteTheUniversityofCaliforniaSanFranciscoDeFreitas,JessicaAvarietyofclinicaldataabstractedandanonymizedfromelectronichealthrecords(EHR)areoftenusedforresearchpurposes.Oneconsistentconcernwiththistypeofresearchistheriskforre-identificationofpatientsfromtheiranonymizeddata.Here,weusetheEHRof731,850patientstodemonstratethattheaveragepatientisuniquefromallothers98.4%ofthetimesimplybyexaminingwhatlaboratorytestshavebeenorderedforthem.Bythetimeapatienthasvisitedthehospitalontwoseparatedays,theyareuniquein74.2%ofcases.Wefurtherpresentacomputationalstudytoidentifyhowaccuratelytherecordsfromasingledayofcarecanbeusedtore-identifypatientsfromasetof99otherpatients.Weshowthat,givenasinglevisit’slaboratoryordersforapatient,wecanre-identifythepatientatleast25%ofthetime.Furthermore,wecanplacethispatientamongthetop10mostsimilarpatients47%ofthetime.Finally,wepresentaproof-of-concepttechniqueusingavariationalautoencodertoencodelaboratoryresultsintoalower-dimensionallatentspace.Wedemonstratethatreleasinglatent-spaceencodedlaboratoryorderssignificantlyimprovesprivacycomparedtoreleasingrawlaboratoryorders(<5%re-identification),whilepreservinginformationcontainedwithinthelaboratoryorders(AUCof>0.9forrecreatingencodedvalues).Ourfindingspotentiallyhaveconsequencesforthepublicreleaseofanonymizedlaboratoryteststothebiomedicalresearchcommunity.Wewishtonotethatourfindingsdonotimplythatlaboratorytestsalonearepersonallyidentifiable,butwouldrequireathreatactorhavinganexternalsourceoflaboratoryvalueswhicharelinkedtopersonalidentifierstobeginwith.
29
PROTECTINGGENOMICDATAPRIVACYWITHPROBABILISTICMODELING
SeanSimmons1,BonnieBerger2,CenkSahinalp3
1BroadInstitute,2MIT,3IndianaUniversitySimmons,SeanTheproliferationofsequencingtechnologiesinbiomedicalresearchhasraisedmanynewprivacyconcerns.Theseincludeconcernsoverthepublicationofaggregatedataatagenomicscale(e.g.minorallelefrequencies,regressioncoefficients).Methodssuchasdifferentialprivacycanovercometheseconcernsbyprovidingstrongprivacyguarantees,butcomeatthecostofgreatlyperturbingtheresultsoftheanalysisofinterest.Hereweinvestigateanalternativeapproachforachievingprivacy-preservingaggregategenomicdatasharingwithoutthehighcosttoaccuracyofdifferentiallyprivatemethods.Inparticular,wedemonstratehowotherideasfromthestatisticaldisclosurecontrolliterature(inparticular,theideaofdisclosurerisk)canbeappliedtoaggregatedatatohelpensureprivacy.ThisisachievedbycombiningminimalamountsofperturbationwithBayesianstatisticsandMarkovChainMonteCarlotechniques.WetestourtechniqueonaGWASdatasettodemonstrateitsutilityinpractice.Animplementationisavailableathttps://github.com/seanken/PrivMCMC.
30
PATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK
PROCEEDINGSPAPERSWITHPOSTERPRESENTATIONS
31
SNPS2CHIP:LATENTFACTORSOFCHIP-SEQTOINFERFUNCTIONSOFNON-CODINGSNPS
ShankaraAnand,LaurynasKalesinskas,CraigSmail,YosukeTanigawa
StanfordUniversityTanigawa,YosukeGeneticvariationsofthehumangenomearelinkedtomanydiseasephenotypes.Whilewhole-genomesequencingandgenome-wideassociationstudies(GWAS)haveuncoveredanumberofgenotype-phenotypeassociations,theirfunctionalinterpretationremainschallenginggivenmostsinglenucleotidepolymorphisms(SNPs)fallintothenon-codingregionofthegenome.Advancesinchromatinimmunoprecipitationsequencing(ChIP-seq)havemadelarge-scalerepositoriesofepigeneticdataavailable,allowinginvestigationofcoordinatedmechanismsofepigeneticmarkersandtranscriptionalregulationandtheirinfluenceonbiologicalfunction.Toaddressthis,weproposeSNPs2ChIP,amethodtoinferbiologicalfunctionsofnon-codingvariantsthroughunsupervisedstatisticallearningmethodsappliedtopublicly-availableepigeneticdatasets.WesystematicallycharacterizedlatentfactorsbyapplyingsingularvaluedecompositiontoChIP-seqtracksoflymphoblastoidcelllines,andannotatedthebiologicalfunctionofeachlatentfactorusingthegenomicregionenrichmentanalysistool.Usingtheseannotatedlatentfactorsasreference,wedevelopedSNPs2ChIP,apipelinethattakesgenomicregion(s)asaninput,identifiestherelevantlatentfactorswithquantitativescores,andreturnsthemalongwiththeirinferredfunctions.Asacasestudy,wefocusedonsystemiclupuserythematosusanddemonstratedourmethod'sabilitytoinferrelevantbiologicalfunction.WesystematicallyappliedSNPs2ChIPonpubliclyavailabledatasets,includingknownGWASassociationsfromtheGWAScatalogueandChIP-seqpeaksfromapreviouslypublishedstudy.Ourapproachtoleveragelatentpatternsacrossgenome-wideepigeneticdatasetstoinferthebiologicalfunctionwilladvanceunderstandingofthegeneticsofhumandiseasesbyacceleratingtheinterpretationofnon-codinggenomes.
32
DNASTEGANALYSISUSINGDEEPRECURRENTNEURALNETWORKS
HoBae1,ByunghanLee2,3,SunyoungKwon2,4,SungrohYoon1,2,5
1InterdisciplinaryPrograminBioinformatics,SeoulNationalUniversity;2ElectricalandComputerEngineering,SeoulNationalUniversity;3ElectronicandITMediaEngineering,SeoulNationalUniversityofScienceandTechnology;4ClovaAIResearch,NAVERCorp;
5ASRIandINMC,SeoulNationalUniversityBae,HoRecentadvancesinnext-generationsequencingtechnologieshavefacilitatedtheuseofdeoxyribonucleicacid(DNA)asanovelcovertchannelsinsteganography.Therearevariousmethodsthatexistinotherdomainstodetecthiddenmessagesinconventionalcovertchannels.However,theyhavenotbeenappliedtoDNAsteganography.Thecurrentmostcommondetectionapproaches,namelyfrequencyanalysis-basedmethods,oftenoverlookimportantsignalswhendirectlyappliedtoDNAsteganographybecausethosemethodsdependonthedistributionofthenumberofsequencecharacters.Toaddressthislimitation,weproposeageneralsequencelearning-basedDNAsteganalysisframework.Theproposedapproachlearnstheintrinsicdistributionofcodingandnon-codingsequencesanddetectshiddenmessagesbyexploitingdistributionvariationsafterhidingthesemessages.Usingdeeprecurrentneuralnetworks(RNNs),ourframeworkidentifiesthedistributionvariationsbyusingtheclassificationscoretopredictwhetherasequenceistobeacodingornon-codingsequence.Wecompareourproposedmethodtovariousexistingmethodsandbiologicalsequenceanalysismethodsimplementedontopofourframework.Accordingtoourexperimentalresults,ourapproachdeliversarobustdetectionperformancecomparedtoothertools.
33
LEARNINGCONTEXTUALHIERARCHICALSTRUCTUREOFMEDICALCONCEPTSWITHPOINCAIRÉEMBEDDINGSTOCLARIFYPHENOTYPES
BrettK.Beaulieu-Jones,IsaacS.Kohane,AndrewL.Beam
HarvardMedicalSchoolBeaulieu-Jones,BrettBiomedicalassociationstudiesareincreasinglydoneusingclinicalconcepts,andinparticulardiagnosticcodesfromclinicaldatarepositoriesasphenotypes.Clinicalconceptscanberepresentedinameaningful,vectorspaceusingwordembeddingmodels.Theseembeddingsallowforcomparisonbetweenclinicalconceptsorforstraightforwardinputtomachinelearningmodels.Usingtraditionalapproaches,goodrepresentationsrequirehighdimensionality,makingdownstreamtaskssuchasvisualizationmoredifficult.WeappliedPoincaréembeddingsina2-dimensionalhyperbolicspacetoalarge-scaleadministrativeclaimsdatabaseandshowperformancecomparableto100-dimensionalembeddingsinaeuclideanspace.Wethenexaminediseaserelationshipsunderdifferentdiseasecontextstobetterunderstandpotentialphenotypes.
34
EXPLORINGMICRORNAREGULATIONOFCANCERWITHCONTEXT-AWAREDEEPCANCERCLASSIFIER
BlakePyman,AlirezaSedghi,ShekoofehAzizi,KathrinTyryshkin,NeilRenwick,ParvinMousavi
Queen'sUniversity
Pyman,BlakeBackground:MicroRNAs(miRNAs)aresmall,non-codingRNAthatregulategeneexpressionthroughpost-transcriptionalsilencing.DifferentialexpressionobservedinmiRNAs,combinedwithadvancementsindeeplearning(DL),havethepotentialtoimprovecancerclassificationbymodellingnon-linearmiRNA-phenotypeassociations.WeproposeanovelmiRNA-baseddeepcancerclassifier(DCC)incorporatinggenomicandhierarchicaltissueannotation,capableofaccuratelypredictingthepresenceofcancerinwiderangeofhumantissues.Methods:miRNAexpressionprofileswereanalyzedfor1746neoplasticand3871normalsamples,across26typesofcancerinvolvingsixorgansub-structuresand68celltypes.miRNAswererankedandfilteredusingaspecificityscorerepresentingtheirinformationcontentinrelationtoneoplasticity,incorporating3levelsofhierarchicalbiologicalannotation.ADLarchitecturecomposedofstackedautoencoders(AE)andamulti-layerperceptron(MLP)wastrainedtopredictneoplasticityusing497abundantandinformativemiRNAs.AdditionalDCCsweretrainedusingexpressionofmiRNAcistronsandsequencefamilies,andcombinedasadiagnosticensemble.ImportantmiRNAswereidentifiedusingbackpropagation,andanalyzedinCytoscapeusingiCTNetandBiNGO.Results:Nestedfour-foldcross-validationwasusedtoassesstheperformanceoftheDLmodel.Themodelachievedanaccuracy,AUC/ROC,sensitivity,andspecificityof94.73\%,98.6\%,95.1\%,and94.3\%,respectively.Conclusion:DeepautoencodernetworksareapowerfultoolformodellingcomplexmiRNA-phenotypeassociationsincancer.TheproposedDCCimprovesclassificationaccuracybylearningfromthebiologicalcontextofbothsamplesandmiRNAs,usinganatomicalandgenomicannotation.AnalyzingthedeepstructureofDCCswithbackpropagationcanalsofacilitatebiologicaldiscovery,byperforminggeneontologysearchesonthemosthighlysignificantfeatures.
35
ESTIMATINGCLASSIFICATIONACCURACYINPOSITIVE-UNLABELEDLEARNING:CHARACTERIZATIONANDCORRECTIONSTRATEGIES
RashikaRamola,ShantanuJain,PredragRadivojac
NortheasternUniversityRamola,RashikaAccuratelyestimatingperformanceaccuracyofmachinelearningclassifiersisoffundamentalimportanceinbiomedicalresearchwithpotentiallysocietalconsequencesuponthedeploymentofbest-performingtoolsineverydaylife.Althoughclassificationhasbeenextensivelystudiedoverthepastdecades,thereremainunderstudiedproblemswhenthetrainingdataviolatethemainstatisticalassumptionsrelieduponforaccuratelearningandmodelcharacterization.Thisparticularlyholdstrueintheopenworldsettingwhereobservationsofaphenomenongenerallyguaranteeitspresencebuttheabsenceofsuchevidencecannotbeinterpretedastheevidenceofitsabsence.Learningfromsuchdataisoftenreferredtoaspositive-unlabeledlearning,aformofsemi-supervisedlearningwherealllabeleddatabelongtoone(say,positive)class.Toimprovethebestpracticesinthefield,weherestudythequalityofestimatedperformanceinpositive-unlabeledlearninginthebiomedicaldomain.Weprovideevidencethatsuchestimatescanbewildlyinaccurate,dependingonthefractionofpositiveexamplesintheunlabeleddataandthefractionofnegativeexamplesmislabeledaspositivesinthelabeleddata.Wethenpresentcorrectionmethodsforfoursuchmeasuresanddemonstratethattheknowledgeoraccurateestimatesofclasspriorsintheunlabeleddataandnoiseinthelabeleddataaresufficientfortherecoveryoftrueclassificationperformance.Weprovidetheoreticalsupportaswellasempiricalevidencefortheefficacyofthenewperformanceestimationmethods.
36
EXTRACTINGALLELICREADCOUNTSFROM250,000HUMANSEQUENCINGRUNSINSEQUENCEREADARCHIVE
BrianTsui,MichelleDow,DylanSkola,HannahCarter
DepartmentofMedicine,UniversityofCaliforniaSanDiego,9500GilmanDrive,SanDiego,California92093,USA
Tsui,BrianYTheSequenceReadArchive(SRA)containsoveronemillionpubliclyavailablesequencingrunsfromvariousstudiesusingavarietyofsequencinglibrarystrategies.Thesedatainherentlycontaininformationaboutunderlyinggenomicsequencevariantswhichweexploittoextractallelicreadcountsonanunprecedentedscale.Wereprocessedover250,000humansequencingruns(>1000TBdataworthofrawsequencedata)intoasingleunifieddatasetofallelicreadcountsfornearly300,000variantsofbiomedicalrelevancecuratedbyNCBIdbSNP,wheregermlinevariantsweredetectedinamedianof912sequencingruns,andsomaticvariantsweredetectedinamedianof4,876sequencingruns,suggestingthatthisdatasetfacilitatesidentificationofsequencingrunsthatharborvariantsofinterest.Allelicreadcountsobtainedusingatargetedalignmentwereverysimilartoreadcountsobtainedfromwhole-genomealignment.AnalyzingallelicreadcountdataformatchedDNAandRNAsamplesfromtumors,wefindthatRNA-seqcanalsorecovervariantsidentifiedbyWholeExomeSequencing(WXS),suggestingthatreprocessedallelicreadcountscansupportvariantdetectionacrossdifferentlibrarystrategiesinSRA.ThisstudyprovidesarichdatabaseofknownhumanvariantsacrossSRAsamplesthatcansupportfuturemeta-analysesofhumansequencevariation.
37
AUTOMATICHUMAN-LIKEMININGANDCONSTRUCTINGRELIABLEGENETICASSOCIATIONDATABASEWITHDEEPREINFORCEMENTLEARNING
HaohanWang1,XiangLiu2,YifengTao1,WentingYe1,QiaoJin3,WilliamW.Cohen4,EricP.Xing5
1CarnegieMellonUniversity,2ChineseUniversityofHongKong,3TsinghuaUniversity,
4GoogleAI,5PettumIncWang,HaohanTheincreasingamountofscientificliteratureinbiologicalandbiomedicalscienceresearchhascreatedachallengeinthecontinuousandreliablecurationofthelatestknowledgediscovered,andautomaticbiomedicaltext-mininghasbeenoneoftheanswerstothischallenge.Inthispaper,weaimtofurtherimprovethereliabilityofbiomedicaltext-miningbytrainingthesystemtodirectlysimulatethehumanbehaviorssuchasqueryingthePubMed,selectingarticlesfromqueriedresults,andreadingselectedarticlesforknowledge.Wetakeadvantageoftheefficiencyofbiomedicaltext-mining,theflexibilityofdeepreinforcementlearning,andthemassiveamountofknowledgecollectedinUMLSintoanintegrativeartificialintelligentreaderthatcanautomaticallyidentifytheauthenticarticlesandeffectivelyacquiretheknowledgeconveyedinthearticles.Weconstructasystem,whosecurrentprimarytaskistobuildthegeneticassociationdatabasebetweengenesandcomplextraitsofthehuman.Ourcontributionsinthispaperarethree-fold:1)Weproposetoimprovethereliabilityoftext-miningbybuildingasystemthatcandirectlysimulatethebehaviorofaresearcher,andwedevelopcorrespondingmethods,suchasBi-directionalLSTMfortextminingandDeepQ-Networkfororganizingbehaviors.2)Wedemonstratetheeffectivenessofoursystemwithanexampleinconstructingageneticassociationdatabase.3)Wereleaseourimplementationasagenericframeworkforresearchersinthecommunitytoconvenientlyconstructotherdatabases.
38
PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA
PROCEEDINGSPAPERWITHPOSTERPRESENTATION
39
INFLUENCEOFTISSUECONTEXTONGENEPRIORITIZATIONFORPREDICTEDTRANSCRIPTOME-WIDEASSOCIATIONSTUDIES
BinglanLi1,YogasudhaVeturi1,YukiBradford1,ShefaliS.Verma1,AnuragVerma1,AnastasiaM.Lucas1,DavidW.Haas2,MarylynD.Ritchie1
1UniversityofPennsylvania,2VanderbiltUniversity
Ritchie,MarylynTranscriptome-wideassociationstudies(TWAS)haverecentlygainedgreatattentionduetotheirabilitytoprioritizecomplextrait-associatedgenesandpromotepotentialtherapeuticsdevelopmentforcomplexhumandiseases.TWASintegratesgenotypicdatawithexpressionquantitativetraitloci(eQTLs)topredictgeneticallyregulatedgeneexpressioncomponentsandassociatespredictionswithatraitofinterest.Assuch,TWAScanprioritizegeneswhosedifferentialexpressionscontributetothetraitofinterestandprovidemechanisticexplanationofcomplextrait(s).Tissue-specificeQTLinformationgrantsTWAStheabilitytoperformassociationanalysisontissueswhosegeneexpressionprofilesareotherwisehardtoobtain,suchasliverandheart.However,aseQTLsaretissuecontext-dependent,whetherandhowthetissue-specificityofeQTLsinfluencesTWASgeneprioritizationhasnotbeenfullyinvestigated.Inthisstudy,weaddressedthisquestionbyadoptingtwodistinctTWASmethods,PrediXcanandUTMOST,whichassumesingletissueandintegrativetissueeffectsofeQTLs,respectively.Thirty-eightbaselinelaboratorytraitsin4,360antiretroviraltreatment-naïveindividualsfromtheAIDSClinicalTrialsGroup(ACTG)studiescomprisedtheinputdatasetforTWAS.WeperformedTWASinatissue-specificmannerandobtainedatotalof430significantgene-traitassociations(q-value<0.05)acrossmultipletissues.Singletissue-basedanalysisbyPrediXcancontributed116ofthe430associationsincluding64uniquegene-traitpairsin28tissues.Integrativetissue-basedanalysisbyUTMOSTfoundtheother314significantassociationsthatinclude50uniquegene-traitpairsacrossall44tissues.Bothanalyseswereabletoreplicatesomeassociationsidentifiedinpastvariant-basedgenome-wideassociationstudies(GWAS),suchashigh-densitylipoprotein(HDL)andCETP(PrediXcan,q-value=3.2e-16).Bothanalysesalsoidentifiednovelassociations.Moreover,singletissue-basedandintegrativetissue-basedanalysisshared11of103uniquegene-traitpairs,forexample,PSRC1-low-densitylipoprotein(PrediXcan’slowestq-value=8.5e-06;UTMOST’slowestq-value=1.8e-05).Thisstudysuggeststhatsingletissue-basedanalysismayhaveperformedbetteratdiscoveringgene-traitassociationswhencombiningresultsfromalltissues.Integrativetissue-basedanalysiswasbetteratprioritizinggenesinmultipletissuesandintrait-relatedtissue.Additionalexplorationisneededtoconfirmthisconclusion.Finally,althoughsingletissue-basedandintegrativetissue-basedanalysissharedsignificantnoveldiscoveries,tissuecontext-dependencyofeQTLsimpactedTWASgeneprioritization.Thisstudyprovidespreliminarydatatosupportcontinuedworkontissuecontext-dependencyofeQTLstudiesandTWAS.
40
SINGLECELLANALYSIS–WHATISTHEFUTURE?
PROCEEDINGSPAPERWITHPOSTERPRESENTATION
41
SHALLOWSPARSELY-CONNECTEDAUTOENCODERSFORGENESETPROJECTION
MaxwellP.Gold,AlexanderLeNail,ErnestFraenkel
MassachusettsInstituteofTechnologyGold,MaxwellWhenanalyzingbiologicaldata,itcanbehelpfultoconsidergenesets,orpredefinedgroupsofbiologicallyrelatedgenes.Methodsexistforidentifyinggenesetsthataredifferentialbetweenconditions,butlargepublicdatasetsfromconsortiumprojectsandsingle-cellRNA-Sequencinghaveopenedthedoorforgenesetanalysisusingmoresophisticatedmachinelearningtechniques,suchasautoencodersandvariationalautoencoders.Wepresentshallowsparsely-connectedautoencoders(SSCAs)andvariationalautoencoders(SSCVAs)astoolsforprojectinggene-leveldataontogenesets.Wetestedtheseapproachesonsingle-cellRNA-SequencingdatafrombloodcellsandonRNA-Sequencingdatafrombreastcancerpatients.BothSSCAandSSCVAcanrecoverknownbiologicalfeaturesfromthesedatasetsandtheSSCVAmethodoftenoutperformsSSCA(andsixexistinggenesetscoringalgorithms)onclassificationandpredictiontasks.
42
WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA
PROCEEDINGSPAPERWITHPOSTERPRESENTATION
43
IMPLEMENTINGAUNIVERSALINFORMEDCONSENTPROCESSFORTHEALLOFUSRESEARCHPROGRAM
MeganDoerr1,ShiraGrayson1,SarahMoore1,ChristineSuver1,JohnWilbanks1,JenniferWagner2
1SageBionetworks,2CenterforTranslationalBioethics&HealthCarePolicyGeisinger
Doerr,MeganTheUnitedStates’AllofUsResearchProgramisalongitudinalresearchinitiativewithambitiousnationalrecruitmentgoals,includingofpopulationstraditionallyunderrepresentedinbiomedicalresearch,manyofwhomhavehighgeographicmobility.Theprogramhasadistributedinfrastructure,withkeyprogrammaticresourcesspreadacrosstheUS.Givenitsplanneddurationandgeographicreachbothintermsofrecruitmentandprogrammaticresources,adiversityofstateandterritorylawsmightapplytotheprogramovertimeaswellastothedeterminationofparticipants’rights.Herewepresentalistinganddiscussionofstateandterritoryguidanceandregulationofspecificrelevancetotheprogram,andourapproachtotheirincorporationwithintheprogram’sinformedconsentprocesses.
44
GENERAL
POSTERPRESENTATIONS
45
ACONVOLUTIONALNEURALNETPREDICTSBINDINGPROPERTIESOFANANTIBODYLIBRARY
RishiBedi,RachelHovde,JacobGlanville
DistributedBioHovde,RachelResearchbyGlanvilleetal.describedamethodthatenabledTCRsoftheadaptiveimmunesystemtobeclusteredintospecificitygroupsandalloweddenovodesignofTCRswithaparticularspecificity.Inthisstudy,weapplydeeplearningmethodstoperformcharacterizationandengineeringofantibodies.Togenerateenoughdatatoaddressthisquestionwithmachinelearningmethods,wecreatedacomputationally-optimizedantibodylibrarycapableofgeneratingthousandsofhighaffinityhitsagainstanyantigen.Byroboticallypanning11antigensinreplicateagainstthelibrary,wegenerated,sequenced,andvalidatedadatasetofover55,000uniquehighaffinitybinders.Tocharacterizethefunctionalpropertiesofthislibrary,wetrainaconvolutionalneuralnetworktopredictthebindingspecificityofeachclone.Ourmodeloutperformsalternativeapproachesandsuccessfullypredictsbindingspecificityinheld-out,increasinglydissimilartestsets.Usingthetrainedmodeltoperformoptimizationontheinputsequence,wegeneratecharacteristicclassexamples,aswellas"foolingsequences"thatrepresenttheboundariesbetweenpairsofbindingspecificities.Weusethereal-valuedoutputoftheconvolutionalandlinearlayersofthenetworkasanembeddinganddemonstratephysically-meaningfulclustering.Thesetechniquesletusassessthecontributionofparticularmotifstothelock-and-keyinteractionwiththetargetantigen,andenablevirtual"epitopebinning"todistinguishantibodiesinourlibrarythatbindsimilarepitopes.Thisenablesfutureworkinvirtualmutagenesis,whereweleveragetheseinsightstogenerateantibodiesthatexhibitdesirablebindingproperties.
46
CNVAR:ASOFTWARETOOLFORGENOTYPINGCYP2D6USINGSHORTREADNEXTGENERATIONSEQUENCINGTECHNOLOGY
JohnLoganBlackIIIMD1,HuguesSicottePhD1,SandraE.Peterson1,KimberleyJ.Harris1,LieweiWangMDPhD1,StevenSchererPhD2,EricBoerwinklePhD2,RichardA.Gibbs
PhD2,SuzetteJ.BielinskiPhD1,RichardWeinshilboumMD1
1MayoClinic,2BaylorCollegeofMedicine
Black,JohnIntroduction:CYP2D6isanimportantpharmacogeneinvolvedinthemetabolismofmanymedications.CYP2D6isknowntohavenumerouscopynumbervariations(CNV)includinggeneduplications/multiplications,genedeletion,andhybridgenesinvolvingthepseudogene,CYP2D7.SoftwarethatenablesthegenotypingofCYP2D6fromshortreadnextgenerationsequencing(NGS)isurgentlyneededtocost-effectivelyandaccuratelydetermineclinicalCYP2D6phenotypes.Methods:ModellingofexpectedratiosforspecificgeneregionswithandwithoutCNVwasdonebasedupontheknownconfigurationsoftheCYP2Dlocus.ThisdatawasusedtogeneratetheCNVARsoftwarewhichanalyzesvcfandbamfilestodeterminevariantallelicratiosandreaddepthforallexonsandthepromotersoftheCYP2D6andCYP2D7genesafterNGS.ThesoftwareusesstatisticalmethodstodetecttheCNVsandemploysmultiplequalitymetricstodeterminethebestfitforpossiblegenotypesolutions.Italsodetectsnamedhaplotypesplusanynovelvariants.CNVARwaspreviouslyvalidatedagainst500sampleswithknowngenotypesdeterminedbytargetedgenotypingandSangersequencing.SamplessequencedaspartoftheMayoClinicCenterforIndividualizedMedicine'sRIGHT10KStudyforPharmacogenomicsarenowbeinganalyzed.SequencingwasdoneatBaylorCollegeofMedicine'sHumanGenomeSequencingCenterusingthereagentcalledPGx-seqandanalysisoftheCYP2D6sequenceresultsisbeingperformedinthePersonalizedGenomicsLaboratoryatMayoClinic.Results:6921sampleshavebeenanalyzedusingtheCNVARsoftwaretoderiveCYP2D6diplotypes.968(14%)sampleshadqualityflagsindicatingeitherunexpectedallelefrequencies,CNVratios,anovelvariantwasdetected,orseveraldiplotypesolutionsfitthefindingsequallywell.102(1.5%)samplesweredeterminedtohavenovelvariantsornovelhybridgenes.Alloftheremainingsamples,except55(0.79%),couldberesolvedbyvisualinspectionofCNVARoutputs.These55remainingsampleswerereferredforadditionalSangersequencingtodeterminetheactualdiplotypeandquantitativertPCRtodetermineactualcopynumber.Conclusions:CNVARisasoftwaretoolwhichcandetectCYP2D6diplotypes,CNVsandhybridgenesfromNGSshortreadtechnology.Thesoftwareidentifiessamplesthatcannotbegenotypedwithcertaintysothatadditionalevaluationcanbeperformedtoderivetheactualgenotype.Novelvariantsandhybridalleleswerealsoidentifiedsothatvariantcurationandclassificationcouldbedone.ThisworkwassupportedbyMayoClinicCenterforIndividualizedMedicineandtheRobertD.andPatriciaE.KernCenterfortheScienceofHealthCareDelivery,NationalInstitutesofHealthgrantsU19GM61388(ThePharmacogenomicsResearchNetwork),R01GM28157,U01HG005137,R01GM125633,R01AG034676(TheRochesterEpidemiologyProject),andU01HG06379andU01HG06379Supplement(TheElectronicMedicalRecordandGenomics(eMERGE)Network).
47
NETWORKANALYSISOFDISTINCTCOHORTSALLOWSFORTHECOMPARISONOFKEYBIOLOGICALFUNCTIONSRELATEDTOTBPATHOGENESIS
CarlyBobak,MeghanE.Muse,AlexanderJ.Titus,BrockC.Christensen,A.JamesO'Malley,JaneE.Hill
DartmouthCollege
Bobak,CarlyChallengeswithreproducibilityofmicroarraydatasetscanlimittheabilitytoanalyzeandinterpretintegratedgeneexpressiondatasets.Oneapproachtotacklereproducibilityacrossmicroarraydatasetsbuildsamulti-cohortframeworkusingpubliclyavailabledatatobettermirrordiversepopulationsseeninclinics.Analternativewayofincreasingthereproducibilityofresultsisemphasizingunderlyingpathwayornetworklevelanalyses.Whiledifferentialexpressionofgenesmayvarybetweendatasetsanddataanalysistechniques,thebiologicalprocessesunderlyinggeneexpressionaremorerobust.Theresultsfromtheseanalysescandrivehypothesesregardingthebiologicalmechanismsbehinddiseases.Weproposeusingamulti-cohortdesignandapathway-levelgeneexpressionanalysistoidentifykeybiologicalprocessesinactiveTuberculosis(TB)disease.Amulti-cohortapproachisparticularlyimportantwhenanalyzingTBbecausephenotypicpresentationofthediseasediffersamongpatients,especiallythosewhoareco-infectedwithhumanimmunodeficiencyvirus(HIV),oramongchildren.Assuch,thesesubgroupsareoftenexcludedfromstudiesexamininghumangeneexpressionarraydata.However,in2016,10%ofincidentTBcaseswerepeoplelivingwithHIV,and10%werechildren,anddespitethedifficultyofstudyingthesepopulationsalongsideadults,theymakeupasubstantialproportionofbothcurrentlyTBinfectedandtheoverallTBsusceptiblepopulation.TovisualizedifferencesacrosscohortscontainingtheseTBsubgroups,weuseanapproachcalledanEnrichmentMapwhichallowsustorepresenteachdistinctdatasetinonenetwork.Weselectedthreerepresentativepubliclyavailabledatasets(n=1148)andusedDifferentialExpressionandGeneSetEnrichmentAnalysis.Genesetswhichweresignificantlyenrichedbecamethenodesofthenetwork,withedgesrepresentativeoftheoverlapbetweenthesegenesets.TheresultsofthesecombinedanalyseswereusedasaninputtoEnrichmentMap,toclusterandannotateimportantbiologicalfunctions.TheEnrichmentMapnetworkidentifiedmanyprocessesexpectedbasedoncurrentTBknowledge,suchasinterferon-gammaactivity(6genesets).Aswell,someotherprocesseswhichrepresentpotentiallynovelinsightstothediseaseareidentified.WeexamineoneclusterofnodesrelatedtoDNAmethylation(6genesets)indepth.TheDNAmethylationgenesetwithinthisclusterwasstronglyenrichedinthedatasetwithnoHIV+patients(FDR=0.004)andappearstobeenriched,althoughinsignificantlyso,inthetwodatasetsincludingHIV+patients(FDR=0.518,0.879).FurtherunsupervisedanalysisofDNAmethylationgeneswithinthesesetsrevealsclearclusteringofactiveTBpatientsfromthosewithlatentTBinfection,irrespectiveofHIV+.Thus,wetheorizethatwhileconventionalmethodswouldnotimplicateDNAmethylationasplayingaroleinactiveTBinfection,bycomparingenrichmentsacrossdatasetsatthenetworklevelwecanobservepatternsingeneexpressionwithafinerdegreeofgranularity.
48
VARIATIONINOPIOIDPRESCRIBINGPATTERNSINSURGICALPOPULATIONS
SolineM.Boussard1,MarylynD.Ritchie2,MichelleWhirl-Carrillo3,TinaHernandez-Boussard3,TeriE.Klein3
1CastillejaSchool,2UniversityofPennsylvania,3StanfordUniversity
Boussard,SolineIntroductionInclinicalsettings,patients'responsetoopioidscanvarybyasmuchas40-fold.CommonopioidsrequiremetabolismbyliverenzymeCYP-2D6andconsiderablevariationexistsintheamountofCYP-2D6producedbyindividuals.Therefore,pharmacogenomicsmayshedlightastohowtoaddressdifferentresponsesbyfindingthemosteffectivemedicationanddosageforeachpatient.Asafirststepatidentifyingopportunitiesforpersonalizedpainmanagement,weanalyzedpostoperativepainandopioidprescribingpatternsacrossfourcommonsurgeriesknownforhighpostoperativepain.MethodsWeusedEHRstoidentifypatientsundergoing4surgeries(totalkneereplacement(TKA),thoracotomy,distalradiusfracture,andmastectomy).Themainoutcomesweredischargepainmedicationsandpostoperativepainscores.ThisresearchwaspossiblethroughtheuseofstructuredEHRdataandthemappingofmedicationstoontologies.Patientswereidentifiedusingproceduralcodes;painscores(painscoresrangefrom0to10with10beingthemostsevere)wereidentifiedfromflowsheetswithintheEHR,anddischargemedicationsweremappedtoRXNorm.Datawereaggregatedtothepatientlevel.Painscoreswereaveragedacrossdifferenttimepoints.RStudiowasusedforstatisticalcomputingandgraphics.Chi-square,t-testsandanalysisofvariancewereusedforstatisticaltesting.ResultsAtotalof63,500patientswereincluded.Themeanagewas61.31(SD:14.3),65.3%werefemale,62.1%werewhiteand13.3%wereHispanic/Latinoethnicity.Onaverage,painscoreswerelowerat30daysfollow-upcomparedtopre-operativeandpatientsreceived4.1differenttypesofopioidsduringtheirinpatientstay,withamajorityofpatientsswitchingbetweenhydrocodoneandoxycodone.Totalkneereplacementrepresented61.6%followedby20.0%thoracotomy,16.4%mastectomyand2.0%distalradiusfracture.Atdischarge,themajorityofpatientsreceivedoxycodone(69.15%)andhydrocodone(15.29%).Inmastectomy,47.89%receivedhydrocodoneand44.05%receivedoxycodone.ForTKA,78.20%receivedoxycodone,followedby8.90%receivingtramadol.Follow-uppainwassimilaracrossthe4surgeries,howeverthefollow-uppaindifferedbyopioidsreceivedwithpatientsonoxymorphonehavingthehighestfollow-uppain(6.24)andpatientsonpropoxyphenehavingthelowestpain(1.29,p<.0001).DiscussionInthisstudythatexaminespost-operativeoutcomesandprescriptionsinareal-worldsetting,opioidprescribingpatternsvariedsignificantlyacrosssurgerytype.Ourdatasuggestcodeinewasassociatedwithlowerfollow-uppaininTKAcomparedtootheropioids.Thisdatafromreal-worldevidencesuggeststhatwecanusesuchmethodologytoidentifyacohortofpatientsthatmaybetargetedforgenotypingforpersonalizedmedicine.TargetingpatientswithpoorpainrelieffromopioidsthatrequireCYP-2D6foractivationcouldidentifypatientswithgenevariationsthataffectopioidmetabolism.Futurestudiescouldlookatwhatvariantsthatcouldaffectpatients'metabolismforcodeine.
49
REGIONALHETEROGENEITYINGENEEXPRESSION,REGULATIONANDCOHERENCEINHIPPOCAMPUSANDDORSOLATERALPREFRONTALCORTEX
ACROSSDEVELOPMENTANDSCHIZOPHRENIA
LeonardoCollado-Torres1,EmilyE.Burke1,AmyPeterson1,JooHeonShin1,RIchardE.Straub1,AnanditaRajpurohit1,StephenA.Semick1,WilliamS.Ulrich1,BrainSeq
Consortium,CristianValencia1,RanTao1,AmyDeep-Soboslay1,ThomasM.Hyde1,JoelE.Kleinman1,DanielRWeinberger1,+,AndrewE.Jaffe1,+
1LieberInstituteforBrainDevelopment,Baltimore,MD,USA
Background:Wepreviouslyidentifiedwidespreadgenetic,developmental,andschizophrenia-associated(SCZD)changesinpolyadenylatedRNAsinthedorsolateralprefrontalcortex(DLPFC),butthelandscapeofhippocampal(HIPPO)expressionusingRNAsequencingislesswell-explored.
Methods:WeperformedRNA-sequsingRiboZeroon900RNA-seqsamplesacross551individuals(SCZDN=286)inDLPFC(N=453)andHIPPO(N=447).WequantifiedexpressionofmultiplefeaturesummarizationsoftheGencodev25referencetranscriptome,includinggenes,exonsandsplicejunctions.Withinandacrossbrainregions,wemodeledage-relatedchangesincontrolsusinglinearsplines,integratedgeneticdatatoperformexpressionquantitativetraitloci(eQTL)analyses,andperformeddifferentialexpressionanalysescontrollingforobservedandlatentconfounders.
Results:WeidentifiedwidespreaddevelopmentalregulationbetweentheDLPFCandHIPPOoveragingwith10,839genesdifferentiallyexpressed(Bonferroni<0.01)andreplicatinginBrainSpan(n=79tissuesamples,DLPFC=40,HIPPO=39).Ofthesegenes,5,982(55%)containdifferentiallyexpressedexonsandsplicejunctionsthatreplicatedinBrainSpan.Byextendingqualitysurrogatevariableanalysis(qSVA)tomultiplebrainregions,weidentified48and245differentiallyexpressedgenes(DEG)bySCZDdiagnosis(FDR<5%)inHIPPOandDLPFC,respectively,withsurprisinglyminimaloverlapinDEGbetweenthetwobrainregions.Wefurtheridentified205,618brainregion-dependenteQTLs(FDR<1%)andfoundthat124GWASrisklocicontaineQTLsinatleastoneoftheregions.Wealsoidentifypotentialmolecularcorrelatesofinvivoevidenceofalteredprefrontal-hippocampalfunctionalcoherenceinschizophrenia.ThroughoureQTLbrowserresourcehttp://eqtl.brainseq.org/wehavemadealleQTLssetsavailableforfurtherexploration.
Discussion:Weshowextensiveregionalspecificityofdevelopmentalandgeneticregulation,andSCZD-associatedexpressiondifferencesbetweenHIPPOandDLPFC.Theseresultsunderscorethecomplexityandregionalheterogeneityofthetranscriptionalcorrelatesofschizophrenia,andsuggestfutureschizophreniatherapeuticsmayneedtotargetmolecularpathologieslocalizedtospecificbrainregions.
50
FULL-LENGTHSEQUENCEASSEMBLYANDCHARACTERIZATIONOFHIGHLYPURIFIEDCIRCRNAISOFORMS
SupriyoDe,AmareshC.Panda,MyriamGorospe
LaboratoryofGeneticsandGenomics,NationalInstituteonAgingIRP,NIHDe,SupriyoCircularRNAsarealargeheterogenousclassofhighlystablenoncodingRNAsbuttheyarepoorlycharacterized.ManysoftwaretoolsexistforidentifyingcircularRNAsbyfindingtheircircularizingjunctions,butverylittleisknownaboutthesequenceoftheirfulllengthortheirisoforms/alternatelysplicedforms.TheassemblyandcharacterizationofisoformsisalsolimitedbythelackofmethodologiestoextracthighlypurecircRNAs.Whileexoribonuclease(RNaseR)treatmentiswidelyusedtodegradelinearRNAsandenrichcircRNAsfromtotalRNA,itdoesnotefficientlyeliminatealllinearRNAs.Thislimitationcomplicatestheassemblyprocesstogetfull-lengthcircRNAs.HerewedescribeanovelmethodforisolatinghighlypurecircRNApopulationsinvolvingRNaseRtreatmentfollowedbyPolyadenylationandpoly(A)+RNADepletion(RPAD),whichremoveslinearRNAtonearcompletion.OncetheRNApopulationishighlyenriched,sequenceassemblyalgorithmssuchasCufflinkscanbeusedtoidentifythebodyofthecircRNA,whilethecircularizing/back-splicedjunctionscanbefoundusingmanydifferentsoftwaretoolssuchasCircexplorer,CIRIetc.High-throughputsequencingofRNApreparedusingRPADfromhumancervicalcarcinomaHeLacellsandmouseC2C12myoblastsfollowedbythisnovelanalysispipelineledtoidentificationofmanycircRNAisoformswithanidenticalback-splicesequence(circularizingjunction)butwithdifferentbodysizesandsequences.AsoneofthemainfunctionsofcircRNAsisspongingregulatoryRNAsandproteins,full-lengthcharacterizationofcircRNAisoformswillbecriticalforenablingthefunctionalcharacterizationofcircRNAs.Acknowledgement:ThisresearchwassupportedbyIntramuralResearchProgramoftheNationalInstituteonAging,NIH.
51
ACOMPREHENSIVEREVIEWANDASSESSMENTOFEXISTINGPATHWAYANALYSISAPPROACHES
Tuan-MinhNguyen1,AdibShafi1,TinNguyen2,SorinDraghici1
1DeptofComputerScience,WayneStateUniversity;2DeptofComputerScience,
UniversityofNevadaDraghici,SorinInmanyhigh-throughputexperiments,itiscrucialtounderstandthebiologicalmechanismsofgenesandtheirproductsfromexpressiondata.Pathwayanalysisisacrucialstepinanyphenotypecomparisonbecauseitallowsustogaininsightsintotheunderlyingbiologicalphenomena.Becauseoftheimportanceofthistypeofanalysis,morethan35pathwayanalysismethodshavebeenproposedsofar.Thesecanbecategorizedintotwomaincategories:non-pathwaytopologybased(non-TB)andtopology-based(TB)approaches.Non-TBmethodsconsiderpathwaysassimplegenesetsandignorethepositionandroleofthegenes,aswellasthedirectionandtypeofsignalsdescribedbythepathwaywhileTBmethodsincludethisadditionalinformationintheanalysis.Althoughtherearesomereviewpapersdiscussingthistopic,therehasbeennostudythatsystematicallyassessestheperformancesofthemethodsusinganunbiasedandlargenumberofdatasetsavailable.Furthermore,themajorityofthepathwayanalysisapproachesrelyontheassumptionofuniformityofp-valuesunderthenullhypothesis,whichisnotalwaystrue.Noneoftheseexistingreviewstaketheperformancesofthestudiedmethodsunderthenullintoaccountintheircomparisons.Inordertoprovideanaccurateandobjectiveassessmentsothatresearchersandbiologistscanchooseamethodsuitablefortheirpurpose,weprovideanextensiveanalysisof11widelyusedpathwayanalysismethodsfrombothnon-TBandTBgroupsusing2601samplesfrom75humandiseasedatasetsand8methodsusing121samplesfrom11knock-outmousedatasets.Inaddition,weinvestigatetheextenttowhicheachmethodisbiasedunderthenullhypothesis.Overall,theresultshowsTBmethodsperformbetterthannon-TBmethodssincetheytakeintoconsiderationthetopologyinformationandsignalpropagation.Viapermutationandbootstrap,wediscoveranothercriticalconclusionthatmostifnotalllistedapproachesarebiasedandproduceveryskewedresultsunderthenull.
52
ANEWPHYLOGENETICSAMPLINGMETHODUSINGGENERALIZED-ENSEMBLEALGORITHM
TetsuFurukawa,HiroyukiToh
DepartmentofBiomedicalChemistry,SchoolofScienceandTechnology,Kwansei-GakuinUniversity,Sanda,Hyogo,Japan669-1337
Furukawa,TetsuBayesianinferencehasbeenwidelyutilizedfortheevolutionaryanalysisincludingphylogenetictreereconstruction,whereMonteCarlosamplingsuchasMarkovchainMonteCarlo(MCMC)orMetropolis-coupledMCMC(MC3)generatesaposteriordistribution.MonteCarlosamplingisalsoutilizedformolecularsimulationofbiopolymerslikeproteinsandDNA.Oneoftherepresentativemethodsisthereplicaexchangealgorithm,whichisequivalenttoMC3inthemolecularphylogeny.Besidesthereplicaexchangealgorithm,severaldifferentsamplingmethodshavebeendevelopedformolecularsimulation,whicharecollectivelytermedasthegeneralizedensemblealgorithm.Inthisstudy,weexaminedthepossibilitytoapplytheotheralgorithmsbelongingtothegeneralizedensemblealgorithmtothetreereconstruction,inordertodevelopmoreefficientsamplingmethodforthemolecularphylogeny.TheprogramimplementedwiththeothergeneralizedensemblealgorithmwasdevelopedbasedonthesourcecodeofBEASTversion2.5.1.Toevaluatetheperformance,artificialalignmentsweregenerated,sothattheposteriordistributionsofthecorrespondingtreesaredifficulttoberegeneratedbysampling,i.e.thedistributionwithmultiplepeaks.Weappliedourprogramandexistingtoolstotheartificialdata.Then,wecomparedtheresultssuchasthetimesrequiredfortheconvergenceandthedegreeofregenerationoftheposteriordistributions.Thebenefitandpitfallsofourprogramwillbediscussedbasedonthecomparison.
53
CONVERGENTMECHANISMSPERTURBEDBYSCATTEREDSNPSSUSCEPTIBLETOALZHEIMER'SDISEASE
JialiHan1,2,EdwinBaldwin1,JinZhou3,FeiYin4,5,HaiquanLi1,6,
1UniversityofArizona,DepartmentofBiosystemsEngineering;2UniversityofArizona,
DepartmentofSystemsandIndustrialEngineering;3UniversityofArizona,DepartmentofPublicHealth;4UniversityofArizona,DepartmentofPharmacology;5UniversityofArizonaCenterfor
InnovationinBrainScience;6UniversityofArizonaCenterforBiomedicalInformaticsandBiostatistics
Han,JialiAlzheimer'sDisease(AD)isthemostprevalentneurodegenerativedisorderaffecting
approximately50millionpeopleworldwide.Genome-wideassociationstudies(GWAS)haveidentifiedhundredsofsinglenucleotidepolymorphisms(SNPs)associatedwithAD,whiletheeffectsizeofeachindividualSNPislargelymodest.Themolecularmechanismsunderlyingtheseassociationsareyettobeunderstood.OurrecentgenomicanalysisfocusedonunveilingcommondownstreambiologicaleffectorsofintergenicSNPsassociatedwithAD,aimingtounderstandtheinteractive-andsynergeticeffectsthatthegeneticvariantsacrossnon-codingandintergenicregionsareplayinginthepathogenesisofAD.Inthisstudy,datafromGWASandexpressionquantitativetraitlocus(eQTL)studiesbyGTExprojectareintegrated,anddownstreamfunctionalsimilaritybetweentwoSNPsisimputedusinganenhancedmultiscaleinformationtheoreticdistancemodel[1].ThesignificancelevelsaredeterminedthroughextensivepermutationsoftheeQTL-derivedmultiscalenetworkformRNAoverlap,functionalsimilarityandsharedbiologicalprocesses[2].ConvergentmolecularmechanismsbasedongeneontologyareprioritizedatFDR<0.05.
TheprioritizedmechanisticnetworkforADrendersseveralfunctionalmodulesperturbedbyeithercis-eQTLortrans-eQTLelements,correspondingtomultiplecommonmechanismsdownstreamofdistincteQTLswithsomeofthembeingcross-chromosome.Forinstance,SNPsonchromosomessixandonearebothassociatedwithantigenprocessingandpresentationviaregulatingmultiplehumanleukocyteantigengenes(e.g.,HLA-DRB1andHLA-DQA1)andcytokinegenes,suggestingthegeneticinvolvementoftheimmunesystemsandneuroinflammationinthepathogenesisofAD.SNPsonchromosome17andchromosome19co-regulategenesinvolvedinsynaptictransmission,whichisessentialforneuronscommunicationanditsdysfunctionisknowninADleadingtomemoryloss.Otherthancross-chromosomeSNPs,independentintergenicSNPsonthesamechromosomealsoprovideinsightstoADgeneticrisks.ApairofSNPsonchromosome17isprioritizedbyourmethodthroughtheirconvergentassociationwiththeMAPTgene,whichencodestauprotein,regulatesaxonextension,andisknownasariskfactorofavarietyofneurodegenerativedisordersincludingnotonlyADbutalsoFrontotemporaldementiaandParkinsondisease.AnotherpairofSNPsonchromosome19isprioritizedbytheircommonassociationwiththeABCA7gene,whichregulateslipidmetabolismacrosscellularmembranesandissuggestedtobesusceptiblelociforthelate-onsetAD.
ThisstudysuggestsanewstrategyconnectingscatteredAD-susceptiblegeneticvariantswithriskgenesandconvergentdownstreammechanismsimplicatedinADpathogenesis.TheresultswillhelptounderstandhowgeneticvariantsandunderlyingfunctionalmodulesworkinteractivelyandsystematicallytowardADonsetandcouldthusidentifygenetics-specificmoleculartargetsandinspirenewpersonalizedtherapeuticstrategies.[1]Li,H.,etal.npjGenomicMedicine1:16006,2016.[2]Han,J.,etal.PSB,2018,pp.524-535.
54
IDENTIFICATIONANDEVALUATIONOFCO-EXPRESSIONGENENETWORKSFORPACLITAXEL-INDUCEDPERIPHERALNEUROPATHYINBREASTCANCER
SURVIVORS
KordM.Kober1,JonD.Levine2,JudyMastick1,BruceCooper1,StevenPaul1,ChristineMiaskowski1
1UCSFSchoolofNursing,2UCSFSchoolofMedicine
Kober,KordChronicchemotherapy-inducedperipheralneuropathy(CIPN)isthemostcommonandsevereadversedrugreactionassociatedwithneurotoxicchemotherapy(CTX)withprevalenceratesthatrangefrom30%to70%incancersurvivors.NopharmacologicinterventionsareavailabletopreventCIPN.LackofknowledgeofthefundamentalmechanismsthatunderlieCIPNthwartoureffortstodevelopinterventionstopreventortreatit.IncreasedknowledgeofCIPN'smolecularmechanismscouldidentifytherapeutictargetsforthiscondition.FindingsfromanimalstudiessuggestthatanumberofdiversemechanismsareinvolvedinthedevelopmentofchronicPIPNincludingdamagetoDRGcellbodies;microtubuleassociatedtoxicity;inflammation;distalaxonalinjury;damagetotheperipheralvasculature;modulationofionchannels;andmitochondrialdysfunction.TaxolisacommonCTXdrugthatisassociatedwiththedevelopmentofCIPN.Paclitaxel-inducedperipheralneuropathy(PIPN)isthedoselimitingtoxicityofthisCTXdrug.ThepurposeofthispilotstudywastoevaluateforcoordinatedexpressionvariationsofgenesinRNAextractedfromperipheralbloodfrombreastcancersurvivors,andfromthesemodulesidentifyco-expressedgenesthatareassociatedwithchronicPIPN.GeneexpressioninperipheralbloodwasassayedusingRNA-seqinasampleofbreastcancer(BC)survivorswhodid(n=25)anddidnot(n=25)developPIPN.BCsurvivorswithPIPNweresignificantlyolder;morelikelytobeunemployed;reportedloweralcoholuse;hadahigherBMIandapoorerfunctionalstatus;andhadahighernumberoflowerextremitysiteswithlossoflighttouch,cold,andpainsensations,andhighervibrationthresholds.NobetweengroupdifferenceswerefoundinthecumulativedoseofpaclitaxelreceivedorinthepercentageofpatientswhohadadosereductionordelayduetoPIPN.Co-expressionnetworkanalysiswasperformedtoidentifymodulesofgeneswithhighlycorrelatedexpressionusingthetop5000mostvariantgenes.Thirteencolor-codedmodulesweredetectedranginginsizefrom36to1653genes.Theeigengenesofthe"black"module(n=1653genes)weresignificantlycorrelatedwiththeCIPNphenotype(PearsonR2=0.224,p=0.02).GOenrichmentwasfoundininflammation-relatedterms(e.g.,C-Cchemokinereceptoractivity,Chemokine-mediatedsignalingpathway,Tcellco-stimulation).Functionalproteinassociationnetworkanalysisidentifiedanenrichmentofprotein-proteininteractions(p<0.0002)includinghighlyconnectedgenesthathavepreviouslybeenidentifiedtoberelatedtoCIPN(i.e.,Gprotein-coupledreceptor55,GPR55,andC-X-CMotifChemokineReceptor5,CXCR5).Toourknowledge,thisisthefirststudytoapplysystemsbiologyapproachesusingcirculatingbloodRNA-seqdatainasampleofbreastcancersurvivorswithandwithoutchronicPIPN.WerevealednetworksandcandidategenesassociatedwithchronicPIPNrelatedtoinflammation,andsuggestgenesforvalidationandaspotentialtherapeutictargets.
55
VARIFI-WEB-BASEDAUTOMATICVARIANTIDENTIFICATION,FILTERINGANDANNOTATIONOFAMPLICONSEQUENCINGDATA
MilicaKrunic1,PeterVenhuizen2,LeonhardMüllauer3,BettinaKaserer3,ArndtvonHaeseler1,4
1CenterforIntegrativeBioinformaticsVienna,MaxF.PerutzLaboratories,Universityof
Vienna,MedicalUniversityofVienna,Dr.Bohrgasse9,1030Vienna,Austria;2DepartmentofAppliedGeneticsundCellBiology,UniversityofNaturalResourcesandLifeSciences,Muthgasse18,1190Vienna,Austria;3InstituteofPathology,Medical
UniversityVienna,WähringerGürtel18-20,1090Vienna,Austria;4BioinformaticsandComputationalBiology,FacultyofComputerScience,UniversityofVienna,Vienna,
Austria
Krunic,MilicaFastandaffordablebenchtopsequencersarebecomingmoreimportantinimprovingpersonalizedmedicaltreatment.Still,distinguishinggeneticvariantsbetweenhealthyanddiseasedindividualsfromsequencingerrorsremainsachallenge.HerewepresentVARIFI,apipelineforfindingreliablegeneticvariants(SNPsandINDELs).WeoptimizedparametersinVARIFIbyanalyzingmorethan170ampliconsequencedcancersamplesproducedonthePersonalGenomeMachine(PGM).Incontrasttoexistingpipelines,VARIFIcombinesdifferentanalysismethodsand,basedontheirconcordance,assignsaconfidencescoretoeveryidentifiedvariant.Furthermore,VARIFIappliesvariantfiltersforbiasesassociatedwiththesequencingtechnologies(e.g.incorrectlycalledhomopolymer-associatedindelswithIonTorrent).VARIFIautomaticallyextractsvariantinformationfrompubliclyavailabledatabasesandincorporatesmethodsforvarianteffectprediction.VARIFIrequiresonlylittlecomputationalexperienceandnoin-housecomputepowersincetheanalysesaredoneonourserver.VARIFIisaweb-basedtoolavailableatvarifi.cibiv.univie.ac.at.
56
STATISTICALINFERENCERELIEF(STIR)FEATURESELECTION
TrangT.Le1,RyanJ.Urbanowicz1,JasonH.Moore1,BrettA.McKinney2
1InstituteofBiomedicalInformatics,DepartmentofBiostatistics,EpidemiologyandInformatics,UniversityofPennsylvania,Philadelphia,PA;2TandySchoolofComputer
Science,DepartmentofMathematics,UniversityofTulsa,Tulsa,OKLe,TrangMotivation:Identifyingrelevantfeaturesinhigh-dimensionaldatacanbechallengingwhentheireffectonanoutcomemaybeobscuredbyacomplexinteractionarchitecture.Usingnearestneighbors,Relief-basedalgorithmsaccountforstatisticalinteractionswhenselectingfeatures.However,Relief-basedestimatorsarenon-parametricinthestatisticalsensethattheydonothaveaparameterizedmodelwithanunderlyingprobabilitydistributionfortheestimator,makingitdifficulttodeterminethestatisticalsignificanceofRelief-basedattributeestimates.Thus,astatisticalinferentialformalismisneededtoavoidimposingarbitrarythresholdstoselectthemostimportantfeatures.Method:WereconceptualizetheRelief-basedfeatureselectionalgorithmtocreateanewfamilyofSTatisticalInferenceRelief(STIR)estimatorsthatretainstheabilitytoidentifyinteractionswhileincorporatingsamplevarianceofthenearestneighbordistancesintotheattributeimportanceestimation.ThisvariancepermitsthecalculationofstatisticalsignificanceoffeaturesandadjustmentformultipletestingofRelief-basedscores.Specifically,wedevelopapseudot-testversionofRelief-basedalgorithmsforcase-controldata.Results:WedemonstratethestatisticalpowerandcontroloftypeIerroroftheSTIRfamilyoffeatureselectionmethodsonapanelofsimulateddatathatexhibitspropertiesreflectedinrealgeneexpressiondata,includingmaineffectsandnetworkinteractioneffects.WeshowedthatthestatisticalperformanceusingSTIRp-valuesisthesameasusingpermutationp-valuesbutmuchmorecomputationallyefficient.WecomparetheperformanceofSTIRwhentheadaptiveradiusmethodisusedasthenearestneighborconstructorwithSTIRwhenthefixed-knearestneighborconstructorisused.ApplyingSTIRtorealRNA-Seqdatafromastudyofmajordepressivedisorder,wefoundthat32significantSTIRgenesincludeall8significantgenesfromstandardt-test.STIRgenesoutsideoftheintersectionwitht-testmaybegoodcandidatesforinteractioneffects.Conclusion:STIRisthefirstmethodtouseatheoreticaldistributiontocalculatethestatisticalsignificanceofReliefattributescoreswithoutthecomputationalexpenseofpermutation.ThisvalidatestheSTIRpseudot-testandmeansonecanuseitinsteadofcostlypermutationtesting.STIRformalismgeneralizestoallRelief-basedneighborfindingalgorithms,includingMultiSURF.k=m/6offersabetterdefaultthanthepervasiveuseofk=10,whichisanarbitrarychoiceintheearlyliterature.ExtensionsofSTIRwillinvolvemulti-classdata,quantitativetraitdata(regression)andcorrectionforcovariates.Similarly,weenvisionregression-STIRtofollowalinearmodelformalism.FuturestudieswillapplySTIRtoGWASaswellaseQTLandotherhighdimensionaldatatoidentifyinteractioneffects.
57
DEEPLEARNING-BASEDLONGITUDINALHETEROGENEOUSDATAINTEGRATIONFRAMEWORKFORAD-RELEVANTFEATUREEXTRACTION
GaramLee1,KwangsikNho2,ByungkonKang1,Kyung-AhSohn1,DokyoonKim3
1AjouUniversity,2IndianaUniversitySchoolofMedicine,3Geisinger
Kim,DokyoonAlzheimer'sdisease(AD)isaprogressiveneurodegenerativeconditionmarkedbyadeclineincognitivefunctionswithnovalidateddiseasemodifyingtreatment.ItiscriticalfortimelytreatmenttodetectADinitsearlierstagebeforeclinicalmanifestation.Mildcognitiveimpairment(MCI)isanintermediatestagebetweencognitivelynormalolderadultsandAD.TopredictconversionfromMCItoprobableAD,weappliedadeeplearningapproach,multimodalrecurrentneuralnetwork.Wedevelopedanintegrativeframeworkthatcombinesnotonlycross-sectionalneuroimagingbiomarkersatbaselinebutalsolongitudinalcerebrospinalfluid(CSF)andcognitiveperformancebiomarkersobtainedfromtheAlzheimer'sDiseaseNeuroimagingInitiativecohort(ADNI).Theproposedframeworkintegratedlongitudinalmulti-domaindatawithmissingvalues.ThepythonpackageLIFAD(Deeplearning-basedLongitudinalheterogeneousdataIntegrationFrameworkforAD-relevantfeatureextraction)providespre-constructeddeeplearningarchitectureforaclassificationtask.Ourresultsshowedthat1)ourpredictionmodelforMCIconversiontoADyieldedupto75%accuracy(areaunderthecurve(AUC)=0.83)whenusingonlyasinglemodalityofdataseparately;and2)ourpredictionmodelachievedthebestperformancewith80%accuracy(AUC=0.86)whenincorporatinglongitudinalmulti-domaindata.Amulti-modaldeeplearningapproachhaspotentialtoidentifypersonsatriskofdevelopingADwhomightbenefitmostfromaclinicaltrialorasastratificationapproachwithinclinicaltrials.
58
MICROBIOMEANALYSISOFUNEXPLAINEDCASESOFPNEUMONIAINSOUTHKOREA
SooyeonLim,JaeKyungLee,JiYunNoh,WooJooKim
DepartmentofInternalMedicine,GuroHospital,KoreaUniversityLim,SooyeonNasalswabsampleswereobtainedfrompatientswithsymptomsofpneumoniathroughthetertiaryhospital-basedinfluenzasurveillancesysteminSouthKoreaduring2011-2017.Althoughthesymptomsweresuspectedtobeofviralcausepneumonia,collectedsampleswereconfirmednegative,usingtherespiratoryviruspanel,for16commonrespiratorypathogens,inadditiontothefollowingfiveviruses:EnterovirusD68,WUpolyomavirus,KIpolyomavirus,Parechovirustype1,3,6,andPteropineorthoreovirus.Therefore,16SrRNAscreeningwasperformedtostudythemicrobiomecommunityofthepatients.V3andV4sequencesof16SrRNAwereobtainedusingNexteraXTDNAlibrarypreparationkitandMiSeqReagentKitv3(Illumina).Microbiomeprofilesof92patientsampleswereobtainedthroughIlluminaMiSeq.Thetotaltaxonomiccompositionofthesamplesconsistedof99bacterialgenus,whosesequencesweredetectedinmorethan1%ofthesamples.Commonbacterialpathogenswerepresentaseithersinglepathogenorincombinationwithotherorganismsinthepatientsamples.Althoughsamplescollectedweredifferentinconditions,suchasage,gender,location,andseason,commondominantgenusofbacteriacommonlyknownaspathogenswererevealed.Themostdominantgeneraofbacteriawerethefollowing:Streptococcus,Corynebactierum,Haemophilus,Rhizobium.Basedoncomparativeanalysisofgenuscompositionsaresimilarbutdemonstratedthedifferenceinmicrobialcompositionbetweenagegroups.Wetriedtoisolationdominantcoloniesthroughthemediacultureforwholegenomesequencingandisolatedsinglecolonyand8speciesareidentifiedusingsangersequencing.Aftermoreisolationofsinglecolonies,wewillfocusedonwholegenomesequencingtofindoutreasonofpneumoniasymptomsindetail.
59
POTRA:PATHWAYANALYSISOFCANCERGENOMICSDATAINTHECLOUD
MargaretLinan1,2,JunwenWang1,2,ValentinDinu1,2
1DepartmentofBiomedicalInformatics,ArizonaStateUniversity,Scottsdale,Arizona,
USA;2DepartmentofHealthSciences,MayoClinic,Scottsdale,Arizona,USA
Dinu,ValentinWehaverecentlydevelopedPoTRA(PathwaysofTopologicalRankAnalysis),anovelalgorithmthatusestheGoogleSearchPageRankalgorithmtoidentifybiologicalpathwaysinvolvedincancer.Theanalyticalapproachismotivatedbytheobservationthatlossofconnectivityisacommontopologicaltraitofgeneregulatorynetworksincancer.WeleveragedtheCancerGenomicsCloudenvironmentandappliedPoTRAtoanalyzeTheCancerGenomeAtlas(TCGA)genomicdata,ahigh-qualitypubliclyavailabledatasetoftumorandmatchednormalsamples.Thetopmostinfluentialpathwaysandmostdysregulatedpathwaysin17TCGAprojectswerefound,usingtheKEGG(KyotoEncyclopediaofGenesandGenomes)pathwaydatabase.Overall,pathwaysincanceristhemostcommondysregulatedpathway,andtheMAPKsignalingpathwayisthemostinfluential,whilethepurinemetabolismpathwayisthemostsignificantlydysregulatedmetabolicpathway.Additionally,genomicanalysisworkflowswerecreatedusingdockerandrabixforthedetectionofmRNAmediateddysregulatedpathwaysintheopenaccessTCGArepositorywiththePoTRAtoolintheCGCplatform.Ourapproachillustratestheadvantagesofemployingpowerfulcomputationalmethodstoanalyzelargegenomicdatasetswiththeaimofimprovingourunderstandingofcancerandidentifyingbetterdiagnosesandtreatments.
60
EVALUATINGCELLLINESASMODELSFORMETASTATICCANCERTHROUGHINTEGRATIVEANALYSISOFOPENGENOMICDATA
KeLiu1,PatrickA.Newbury1,BenjaminS.Glicksberg2,WilliamZeng2,EranR.Andrechek3,BinChen1
1DepartmentofPediatricsandHumanDevelopment,CollegeofHumanMedicine,MichiganStateUniversity,GrandRapids,MI,USA;2BakarComputationalHealthSciencesInstitute,UniversityofCaliforniaSanFrancisco,SanFrancisco,CA,USA;
3DepartmentofPhysiology,MichiganStateUniversity,EastLansing,MI,USAChen,BinMetastasisisthemostcommoncauseofcancer-relateddeathand,assuch,thereisanurgentneedtodiscovernewtherapiestotreatmetastasizedcancers.Cancercelllinesarewidely-usedmodelstostudycancerbiologyandtestdrugcandidates.However,itisstillunknowntowhatextenttheyadequatelyresemblethediseaseinpatients.Therecentaccumulationoflarge-scalegenomicdataincelllines,mousemodels,andpatienttissuesamplesprovidesanunprecedentedopportunitytoevaluatethesuitabilityofcelllinesformetastaticcancerresearch.Inthiswork,weusedbreastcancerasacasestudy.Thecomprehensivecomparisonofthegeneticprofilesof57breastcancercelllineswiththoseofmetastaticbreastcancersamplesrevealedsubstantialgeneticdifferences.Inaddition,weidentifiedcelllinesthatmorecloselyresembledifferentsubtypesofmetastaticbreastcancer.Surprisingly,acombinedanalysisofmutation,copynumbervariationandgeneexpressiondatasuggestedthatMDA-MB-231,themostcommonlyusedtriplenegativecelllineformetastaticbreastcancerresearch,hadlittlegenomicsimilaritywithBasal-likemetastaticbreastcancersamples.Wefurthercomparedcelllineswithorganoids,anewtypeofpreclinicalmodelwhicharebecomingmorepopularinrecentyears.Wefoundthatorganoidsoutperformedcelllinesinresemblingthetranscriptomeofmetastaticbreastcancersamples.However,additionaldifferentialexpressionanalysissuggestedthatbothtypesofmodelscouldnotmimictheeffectsoftumormicroenvironmentandmeanwhilehadtheirownbiastowardsmodelingspecificbiologicalprocesses.Ourworkprovidesaguideofcelllineselectioninmetastasis-relatedstudyandshedslightonthepotentialoforganoidsintranslationalresearch.
61
PATHWAYANALYSISOFEHRANDNON-EHR-BASEDGWASCONNECTSLIPIDMETABOLISMTOTHEIMMUNERESPONSE
JasonE.Miller1,ThomasJ.Hoffmann2,3,ElizabethTheusch4,CarlosIribarren5,MarisaW.Medina4,NeilRisch2,3,5,RonaldM.Krauss4,MarylynD.Ritchie1
1DepartmentofGenetics,UniversityofPennsylvania,Philadelphia,PA,USA;2Institutefor
HumanGenetics,UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,USA;3DepartmentofEpidemiologyandBiostatistics,UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,USA;4Children’sHospitalOaklandResearchInstitute,Oakland,CA,USA;5DivisionofResearch,KaiserPermanente,NorthernCalifornia,Oakland,CA,USA
Miller,JasonPathway-analysisisacommonlyusedmethodtointerpretgenome-wideassociationstudy(GWAS)results.Recentlyithasbeenillustratedthatelectronic-health-record(EHR)datafromasingle-cohortcanbeusedtoperformGWAS.However,itisunclearhowthisnewstudydesignmightaffectreplicationofpathway-levelresultswhencomparedtoanon-EHR-basedGWAS.ItisalsounclearhowanEHR-basedstudywillaffectdownstreamanalysessuchastheidentificationofgenesthatareassociatedwithsaidpathways.Weproposeevaluatingthepathway-levelsimilaritiesfromanalysesoftwoseparateGWASstudiesthatuseddifferentmethodologiestoinvestigatethesametraits.Here,weemploythesoftwarePARIS(PathwayAnalysisbyRandomizationIncorporatingStructure)tocomparesummary-levelresultsacrossstudies,thusmakingitmoregeneralizable.PARISgeneratesrandomizedcollectionsoffeatureswhichmimicpathwaystocalculateempiricalp-values.ThisprocessreducestypeIerrorandthemultipletestingburden.WecomparedEHRtonon-EHR-basedGWASresultsusingfourdifferentlipidtraits:low-densitylipoprotein(LDL),high-densitylipoprotein(HDL),triglycerides(TG),andtotalcholesterol(TC).ThedatacamefromtwoGWAS,theGeneticEpidemiologyResourceonAdultHealthandAging(GERA),asingle-cohortEHR-basedGWAS,andtheGlobalLipidsGeneticsConsortium(GLGC),whichusedameta-analysisstudydesign.KEGGpathwaysexpectedtoexplainvariationinlipidvaluessuchas"cholesterolmetabolism"and"PPARsignalingpathway"wereidentifiedfrombothstudies.Moreover,therewasasignificantoverlapbetweenthepathwaysidentifiedbetweenstudiesforthesametraits(p<1x10^-14).Thus,specificpathwayscanbereplicatedacrossdistinctcohortsandstudydesigns.Severalpathwaysmadeupofgeneswhoseproteinsareimportantforanimmuneresponsewereidentifiedinbothdatasetsandacrossmultiplelipidtraits.Toseeiflipidmodifyingtherapyaffectsthesamepathwaysofinterest,weperformedpathwayanalysisofCAPRNA-seqexpressionfromTheusch,E.,etal.,2016,whichmeasuredexpressioninimmortalizedcellspreandpoststatinexposure.AmongthepathwaysrepresentedinboththePARISresults(p<0.01)andLCLRNA-seqgenesetenrichmentresults(FDR<25%)are"cholesterolmetabolism"(CM)and"HepatitisC"(HC)pathways.HepatitisCvirus(HCV)infectioncancausechronicliverdiseaseandisassociatedwithahostoflipidandlipoproteinmetabolicdisorders.PARIScanalsoidentifygenesthatarestatisticallysignificantwithineachpathway.Interestingly,inboththeLDLandTCGWAS,thegenethatwassignificantlyassociatedwithboththeCMandHCpathwayswaslowdensitylipoproteinreceptororLDLR,agenethataffectsbothlipidmetabolismandHCviralactivity.Statins,incombinationwithothertherapies,canincreaseefficacyofantiviraltherapybyblockingviralreplication.OurresultshighlighttheneedforfurtherinvestigationintohowgeneticvariationaffectsoutcomesfromthetreatmentofHCVwithstatins,particularlywithrespecttolociassociatedwithlipidtraits.Inconclusion,pathway-levelanalysisofGWASsummary-levelresultscanbeusedtocharacterizesimilaritiesacrossEHRandnon-EHR-basedstudiesandimprovebiologicalinterpretation.
62
META-ANALYSISOFHETEROGENEITYANDBATCHEFFECTSINTHEA549CELLLINE
AbigailMoore,JohnCastorino
SchoolofNaturalSciences,HampshireCollege,Amherst,MAMoore,AbigailMeta-analysisofRNA-seqdataofferstheopportunitytoincreasereproducibilitybyintegratingdatafrommultiplestudies.SuchanalysesarechallengedbyheterogenouscellcultureandRNA-seqtechniques,whichmayconfoundorhidetruebiologicalfindings.Thus,wesoughttoidentifybatchcharacteristicsthatmostsignificantlyaffectgeneexpressioninacelllinecommontolungcancerandviralstudies.WequeriedtheNCBIGEOforRNA-seqdatafromtheA549celllineandfilteredtheresultsforpaired-enddataobtainedviatotalRNAextraction.Acrosseightstudies,wedownloadedrawRNA-seqdatafor23untreatedsamplesandcollectedcorrespondingmetadata.DifferentialexpressionanalysiswithSalmon,TXimportandedgeRidentified3,802differentiallyexpressedgenes(atleasttwofold-change,FDR<0.05).Principalvariantcomponentanalysisrevealedthatmediachoicealoneexplains54%ofexpressionvariationwithin139differentiallyexpressedlungcancerprognosticgenes.Ourfindingshighlighttheimpactofspecificbatcheffectsonbiologicallysignificantgenes.Infuturework,hopetoextendthisanalysistoconsidersinglenucleotidevariants.
63
HYPERPARAMETERTUNINGFORCHIP-SEQPEAKCALLINGSOFTWARETOOLSUSINGPARALLELIZEDBAYESIANOPTIMIZATION.
DongpinOh,JinheeLee,SeonghyeonKim,DohyeonLee,DongwonChoo,GiltaeSong
SchoolofComputerScienceandEngineering,PusanNationalUniversityLee,JinheeChIP-Seqiswidelyusedtounderstandprotein-DNAinteractionandgeneregulation.InChiP-seqdataanalysis,identifyingpeaksignalsisoneofcorecomputationalsteps,butmostexistingsoftwaretoolsstillsufferfromlargeportionoffalsepositivecallsowingtosequencingerrorsandbias,inpartcausedbycopynumbervariations.ChiP-seqanalysistoolsrequirehyperparameterssetbyusersdependingonsequencingqualityandcopynumbervariationrate.However,itishardforuserstoknowthevalidvaluesofthehyperparametersbeforerunningthesoftwaretools.Inaddition,wewouldhavemorefalsepositivepeakcallsforgivenChiP-seqdataifthehyperparametersofpeakcallingtoolsarelessthanoptimal.Inthisstudy,wedevelopasoftwarepipelineforidentifyingtheoptimalvaluesofthehyperparametersinmajorChiP-seqpeakcallingtools.FirstwecollectChiP-seqdatawhosepeaksignalsarelabeledmanuallybyexperts.Thesedataareusedastrainingdatainourhyperparametertuning.Secondwedefineanobjectivefunctiontomeasuretheaccuracyofpeakcallingresults.ThenwelearnoptimalhyperparametersusingthesetrainingdataandobjectivefunctionbasedonBayesianoptimization.WeuseMatern5/2kernelfunctionfortheoptimizationandMonteCarloMarkovChainforparallelprocessing.WevalidateourapproachusingourcollectionofChiP-seqdatalabeledforaround2,000genomicsegmentsincludingpeaksornopeaks.WeapplyoursoftwarepipelineformajorChiP-seqpeakcallingtoolssuchasMACS,SICER,HOMER,andPeakSeq.
64
CROSS-STUDYMETA-ANALYSISIDENTIFIESALTEREDBACTERIALSTRAINSSEPARATINGRESPONDERANDNON-RESPONDERPOPULATIONSACROSS
MULTIPLECHECKPOINT-INHIBITORTHERAPYDATASETS
JayamaryDivyaRavichandar,EricaRutherford,YongganWu,ThomasWeinmaier,Cheryl-EmilianeChow,ShokoIwai,HelenaKiefel,KareemGraham,KarimDabbagh,
ToddDeSantis
SecondGenomeRavichandar,JayamaryDivyaThegutmicrobiotahasemergedasanimportantmodulatorincancerprogressionandagrowingbodyofevidencesupportstheinfluenceofgutmicrobiotaonresponsetocancertherapy,especiallyinthecontextofcheckpointinhibitortherapy.Whileseveralstudiespresentinsightintothelandscapeofmicrobialshiftsmodulatingresponsetocheckpointinhibitors,theymaybeundulyinfluencedbycohort,sequencing-technology,anddataanalysismethods.Further,individualstudiesareoftenunder-poweredtodetectmicrobesdifferentiallyabundantinresponderandnon-responderpopulations,whichcanlimittherapeuticdevelopment.Keytomicrobiome-baseddrugdiscoveryistheidentificationofproteinswiththerapeuticpotentialthatareefficaciousacrosscohorts.Herein,existingpublisheddatasetsinthecheckpoint-inhibitorspacewereminedandintegratedviaacross-studymeta-analysistoidentifybacterialstrainsseparatingresponderandnon-responderpopulations.Wecomparedthebaselinegutmicrobiotaassociatedwithstoolsamplescollectedfromfivediscretecancerpatientcohortsundergoingcheckpoint-inhibitortherapy.Samplesweresequencedononeormoretechnologies(Illumina16SNGS,45416SNGS,andIlluminashotgunmetagenomics)andatotalofsevenpublicly-availabledatasetswereanalyzedherein.Leveragingourmulti-facetedbioinformaticsplatform,whichenablesappropriatemethod-specificqualityfilteringandstatisticaltestingtoidentifydifferentiallyabundantbacteriaatthestrain-level,wewereabletosuccessfullyintegrateanalysisresultsacrossmultiplemicrobiome-profilingtechnologies.Weperformedarandomeffectsmodelbasedmeta-analysisandidentifiedstrainsthatwereconcordantlyenrichedinresponderpopulationsacrossdatasets.Inaseparateanalysiswealsoappliednaturallanguageprocessingtothetextofcancercheckpointinhibitorstudies(availableinPubmed)inordertoobtainadditionalinsightsaboutthemicrobiomeandstrainsofinterestfrompublicationswithnorawdataavailable.Thestrainsidentifiedhereinpresentopportunitiesforminingproteinswithpotentialtoimproveresponsetocheckpointinhibitors.Thiscrossstudymeta-analysisdemonstratesthepowerofSecondGenome'sbioinformaticspipelinetoleveragepubliclyavailabledatasetsandsystematicallyintegratemicrobialshiftsnotonlyacrosssamplesfrommultiplecohortsbutalsoacrosssamplessequencedondifferenttechnologies.Ourin-housestraindatabasethatenablestaxonomicannotationdowntothestrain-levelallowedforcomparisonoffine-grainedbacterialidentitiesacrossdatasets,resolvingakeychallengewithmicrobiomemeta-analysis.Thissystematicandstatistically-drivenintegrationofdatasetsenabledidentificationofstrainsassociatedwithresponseacrossmultipleresponderpopulationsthatwerenotpreviouslyreportedintheindependentanalysisofthesedatasets.
65
AHYPOTHESISOFTHESTABILIZINGROLEOFALUEXPANSIONVIAHOMOLOGYDIRECTEDREPAIROFSPONTANEOUSDNADOUBLESTRANDEDBREAKS
TanmoyRoychowdhury,AlexejAbyzov
MayoClinicAbyzov,AlexejStructuralvariations(SVs)inthehumangenomeoriginatefromdifferentmechanismsrelatedtoDNArepair,replicationerrors,andretrotransposition.Ouranalysesof26,927SVsfromthe1000GenomesProjectrevealeddifferentialdistributionsandconsequencesofSVsofdifferentorigin,e.g.,deletionsfromnon-allelichomologousrecombination(NAHR)aremorepronetodisruptchromatinorganizationwhileprocessedpseudogenescancreateaccessiblechromatin.Spontaneousdoublestrandedbreaks(DSBs)arethebestpredictorofenrichmentofNAHRdeletionsinopenchromatin.Thisevidence,alongwithstrongphysicalinteractionofNAHRbreakpointsbelongingtothesamedeletionsuggeststhatmajorityofNAHRdeletionsarenon-meiotici.e.,originatefromerrorsduringhomologydirectedrepair(HDR)ofspontaneousDSBs.Inturn,theoriginofthespontaneousDSBsisassociatedwithtranscriptionfactorbindinginaccessiblechromatinrevealingthevulnerabilityoffunctional,openchromatin.Thechromatinitselfisenrichedwithrepeats,particularlyAluelementsthatprovidethehomologyrequiredtomaintainstabilityviaHDR.Additionally,weobservedastrikingdifferencebetweendistributionsoffixedandvariableAlusacrossgenomecompartments.Throughco-localizationoffixedAlusandNAHRdeletionsinopenchromatinwehypothesizethatoldAluexpansioninhominidlineagehadastabilizingroleonthehumangenome.
66
STATISTICALLEARNINGWITHHIGH-DIMENSIONALMASSCYTOMETRYDATA
PratyaydiptaRudra1,ElenaHsieh2,DebashisGhosh2
1OklahomaStateUniversity,2UniversityofColoradoDenver
Rudra,PratyaydiptaRecentdevelopmentsinsingle-cellbasedtechnologies,suchasmasscytometry(CyTOF),hasledtotheneedforcomputationalandanalyticapproachesthatcanaccommodatethehighdimensionalityandsingle-cellgranularity.TheanalysisofCyTOFdatacanelucidatenoveldiseasebiomarkersandmechanismsoftheunderlyingimmunopathology,leadingtoimprovedtreatmentsandprognosticmeasures.Theuseofsingle-celltechnologiesallowsforconsiderationofexpressionfrombothaspatialandtemporalframework.Inspiteofthepromisingnatureoftheseplatforms,muchworkremainsinordertobeabletomeaningfullyinterpretthedatainthecontextofbiologicalquestions.Whileend-to-endreproduciblemethodsexistforfluorescenceflowcytometrydataanalysis,theydonotscalewellforCyTOFdatawhichhavemuchhigherdimensionality.Thedataareoftenclusteredintocellsub-populationsfirst,whichcanthenbeusedtoanswerscientificquestionsregardingtheabundanceofcelltypesandexpressionsofspecificparameters(e.g.surfacemarkers,signalingproteins,cytokines)acrossgroups,suchasdiseaseandcontrolgroups,orstimulationregimes.Thestatisticalquestionsaboutthetree-structuredcellpopulationdatacanbevisualizedintwolayers.First,itisclinicallyinterestingtoknowiftheabundanceofthecellsubpopulationsisdifferentacrosstwoormoregroupsand/orconditions.Giventheproportionofcelltypesforeachsample,thenextquestioniswhetherthereisanydifferentialexpressionofsignalingproteinsorcytokines(functionalmeasurementsofthecellpopulationsstudied).Modelingdatawithmultiplelayersofcorrelationusingaclassicalparametricmodeloftenbecomesachallengingtask.Theclassicalparametricmodelsalsohavelimitingdistributionalassumptionssuchasnormality,whichmaynotbetrueforcytometrydata.Inordertotacklethis,wedevelopedanewstatisticallearningmethodologybasedonthekerneldistancecovarianceframeworktocomparethecelltypecompositiondifferentdiseasegroupsandstimulationconditions.High-dimensionalstatisticallearningusingakernelmachineregressionisalsodevelopedtotestthedifferenceincytokineexpressionlevelsacrossdifferentcell-typesanddifferentconditions.Themethodsareappliedtohigh-dimensionaldatasetwecollectedcontainingdifferentsubgroupsofpopulationsincludingSystemicLupusErythematosuspatientsandhealthycontrolsubjects.Thesamplesfromtheperipheralbloodofthesubjectsweretreatedusingthreedifferentstimulationmethods.Preliminaryanalysisofthedatarevealedclinicallyrelevantpatternssuchasdifferentialcelltypeabundancebetweenthediseaseandthecontrolgroup,andalsodifferentialexpressionofseveralcytokines.Forexample,theexpressionofthecytokinesMCP1,Mip1bandIL-1RAwerefoundtobedifferentamongCD14highmonocytesacrossthetwogroups.Anextensivesimulationstudytocompareourstatisticalmethodwiththeexistingapproachesiscurrentlybeingconducted.
67
HARDWAREACCELERATIONOFAPPROXIMATESTRINGMATCHINGFORBOTHSHORTANDLONGREADMAPPING
DamlaSenolCali1,LavanyaSubramanian2,ZülalBingöl3,JeremieS.Kim1,4,RachataAusavarungnirun1,AnantV.Nori2,GurpreetS.Kalsi2,SreenivasSubramoney2,Saugata
Ghose1,CanAlkan3,OnurMutlu1,4
1CarnegieMellonUniversity,2IntelLabs,3BilkentUniversity,4ETHZurichKim,JeremieHighthroughputsequencing(HTS)technologyenablesfastandinexpensivegenerationofbillionsofDNAsequences(i.e.,reads)fromagenome[1,2].Toquicklyandaccuratelyprocesstheplethoraofreads,weneednewcomputationaltechniques.AnalyzingHTSdatarequiresfindingtheoriginallocationsofeachreadviaanapproximatestringmatchingprocessagainstalongreferencegenome.Approximatestringmatchingistypicallyperformedwithanexpensivedynamicprogrammingalgorithm,whichconsumesover90%ofthefirststep'sexecutiontime.Manypriorstudies[3,4]haveidentifiedthisbottleneckinmappingandhaveproposednumerousmethodsforacceleratingthisexpensivesteponawide-arrayofcomputationalplatforms.Ourgoalinthisworkistoprovideafastandefficientimplementationofapproximatestringmatchingtowardsenablingfasterreadmapping.WechoosetoaccelerateBitap[6,7]duetoitsabilitytoperformapproximatestringmatchingwithfastandsimplebitwiseoperations,thatcanbehighlyparallelizedforhighthroughput.Wemodifiedthealgorithmtoenablesearchinglongerpatternsandtoremovethedatadependencybetweentheiterationsandprovideparallelismforthelargeamountofiterations.Unfortunately,inourstudyofBitaponexistingsystems,wefindthatCPUsandGPUsalonearebothlimitedbytheirrespectivearchitecturesandthuscannotfullyutilizetheavailablehardwareformaximalefficiency.Specifically,wefindthattheCPUimplementationofBitapisbottleneckedbycomputationsincetheworkingsetfitswithintheL1cacheandthelimitednumberofcorespreventsthefurtherparallelspeedup.TheGPUimplementationofBitapisbottleneckedbylimitedamountofprivatememoryanddestructiveinterferenceofthreadswhileaccessingthesharedmemory.Inordertoovercometheimbalanceineachoftheabovesystems,weproposeacustomacceleratorforBitapwithcharacteristicsthatfallsbetweentheCPUandGPU.Thisachievesafinerbalanceincomputeresourcesandmemoryforhigherperformanceinapproximatestringmatching.Wealsoexplorethedesignspaceofvariousaccelerators,includingprocessinginmemory.REFERENCES[1]Alkan,Can,etal."LimitationsofNext-generationGenomeSequenceAssembly,"NatureMethods,2011.[2]VanDijk,ErwinL.,etal."TenYearsofNext-generationSequencingTechnology,"TrendsinGenetics,2014.[3]Alser,Mohammed,etal."GateKeeper:ANewHardwareArchitectureforAcceleratingPre-alignmentinDNAShortReadMapping,"Bioinformatics,2017.[4]Kim,JeremieS.,etal."GRIM-Filter:FastSeedLocationFilteringinDNAReadMappingusingProcessing-in-MemoryTechnologies,"BMCGenomics,2018.[5]Baeza-Yates,Ricardo,etal."ANewApproachtoTextSearching,"CommunicationsoftheACM,1992.[6]Wu,Sun,etal."FastTextSearchAllowingErrors."CommunicationsoftheACM,1992.
68
TRANSITIONOFREGULATORYFORCETOWARDTHEGENEEXPRESSIONSDURINGOSTEOBLASTCELLDIFFERENTIATION
YoichiTakenaka
KansaiUniversityTakenaka,YoichiUnderstandingthedynamicsofcelldifferentiationsystemisoneofthebigissueinbiologyandmedicine.IthelpstoacquirecellsofdiseasedorgansfrompluripotentstemcellssuchasEScellofiPScell.Toanalyzethedynamics,thetime-seriesgeneexpressionprofilesofcelllinesfromvariousorganismshavebeenmeasured.Ithasbeenmadeclearthatthemovementofthegeneexpression.However,thedynamicsofthesystemsuchasgeneregulationmechanismarenotwellrevealedyet.
Intheposter,theauthorshowsthedynamicsofgeneregulationsduringtheosteoblastcelldifferentiationprocessfrommesenchymalstemcell.Therearemanygenesandgeneregulationsthatareknowntobeactiveduringtheprocess.However,ithasbeennotreportedthattheactivitytimeofeachregulationandthestrengthoftheactivatedregulations.Theauthorproposedamethodtoelucidatethetransitionsbetweentheactivationandinactivationofgeneregulationsatthetemporalresolutionofsingletimepoints.Themethodmeasuresthestrengthofthegeneregulationsofeachtimepointbyleaveone-time-pointoutway.Thenitdecomposesthetimeseriesofthegeneexpressiondataintopartialseriesusinginformationcriterion.Finally,itdetermineswhethereachgeneregulationofeachpartialtimeseriesisactivatedorinactivated.
Thegeneexpressionprofileoftheosteoblastcelldifferentiationprocessincludes65timepointsrangedfromminus6hourto192hourwhere0houristhetimethecelldifferentiationprocessstarts.TheprofilewasdownloadedfromGenomeNetworkPlatformofNationalInstituteofGenetics,Japan.Thegeneregulatorynetworkthatisactivatedatleastonetimepointduringthedifferentiationwascomposedfromthreereviewedpapers.Itincludes19genesand22regulationswhereRunx2,thekeytranscriptionfactorassociatedwithosteoblastdifferentiation,islocatedatthecenterofthenetwork.Osx,transcriptionfactorSp7,whichservesasamakerforosteoblastdifferentiation,isatthedownstreamofRunx2.
Theresultshowstherearefourdistinctperiodsduringtheosteoblastcelldifferentiation.Andeachperiodindicateswhentheexpressionsofgenesarestronglycontrolled.
Beforethecelldifferentiationprocessstarts,Osx,BMP2,DLX5andHDAC3arethemoststronglycontrolledamongallthe65timepoints.Next,EP300iscontrolledstronglyatthefirstperiod.Then,Creb,HDAC3,HDAC4,HDAC5andOsxare.Andatthefinalperiod,Runx2,Bglap,DLX5,DHAC7andSMAD6are.Theanalysisgivesthehinttocontrolthecelldifferentiationprocess.
69
METHYLATIONPROFILESOFMELANOMATOPREDICTTILS
YihsuanTsai1,NanaNikolaishviliFeinberg1,KathleenConway2,SharonN.Edmiston1,NancyE.Thomas3,JoelS.Parker4
1LinebergerComprehensiveCancerCenter(LCCC),UniversityofNorthCarolinaatChapelHill;2DepartmentofEpidemiology,SchoolofPublicHealth,DepartmentofDermatology,SchoolofMedicine,LinebergerComprehensiveCancerCenter(LCCC),UniversityofNorthCarolinaatChapelHill;3DepartmentofDermatology,SchoolofMedicine,LinebergerComprehensiveCancerCenter(LCCC),UniversityofNorthCarolinaatChapelHill;
4LinebergerComprehensiveCancerCenter(LCCC),DepartmentofGenetics,SchoolofMedicine
Tsai,YihsuanCorrelationsbetweentumorinfiltratinglymphocytes(TILs)andprolongedsurvivalhave
beenreportedinmanycancersincludingmelanoma.However,currentTILassessmentbypathologistsreviewingtheslidesectionsisnotalwaysideal.Inter-observeragreementbetweenpathologistsmaybelowiftheassessmentwasquantitative.Toachieveahigheragreement,theestimatesmaybetranslatedtocategories.HereweproposedtotrainanepigenomicsmodeltoestimatetheT-cellpopulationsinmelanomasamplesusingimmunofluorescence(IF)imageofCD3andCD8T-cells,whichprovidesamoreobjectiveestimationofTILs.
Inpreviouswork,wegeneratedmethylationprofilesfor89melanomaand78nevisamples.TohaveagoldstandardofTILestimate,80outofthe89melanomasampleswerestainedwithIFtoimageCD3,CD8,S100(melanomamarker)andanuclearcounterstain.WedefinedthefractionofCD3and/orCD8positivecellsasT-cellfractionandfounditsestimatefromtheIFimagehasthemostsignificantassociationwithpatientsurvival.Therefore,anelasticnetmodelwasbuiltusingfeaturesfromthemethylationdatasetwithT-cellfractionestimatesfromtheIFimageasresponse.Monte-Carlocrossvalidationwasperformedon2/3ofthesamplestotunetheparameters.Weidentified121CpGsinthefinalmodeltoestimateT-cellfractionwhichgaveusthehighestcorrelationwithpearsonr=0.87invalidationandr=0.91inallsamples.Wealsocomparedthismethodwithtwoothermethods.Inanaïvemethod,weidentifiedCpGswithhighmethylationlevelinexternallymphocytesamplesandlowinournevisamples.Theseprobesrepresentalymphocytemethylationsignatureonanunmethylatednevibackground.Therefore,wecalculatedthemodeofkernel-smoothedDNAmethylationdistributionatthesesitesforeachsampleasasurrogateforlymphocytefractionforthatsample.ThismethodgaveacorrelationrelativetothegoldstandardofR=0.64.Anothermethodusesreference-basedcelldeconvolutionalgorithms,whereapre-builtmethylationreferencewasusedtocomputethefractionsofeachcelltypesviathreedifferentalgorithms.Whileallthreealgorithmsgavesimilarresults,RobustPartialCorrelations(RPC)providesthehighestcorrelationwiththegoldstandard(R=0.58).
Wethenappliedourfinalmodel(121methylationmarkers)toanexternaldataset,TCGA-SKCM,toestimatetheT-cellfractions.SincethereisnogoldstandardforTCGA-SKCMdataset,weusedsurvivalasasurrogate.WefoundtheT-cellfractionestimatefromourmodelhadastrongsurvivalassociation(coxp-value=3.85e-05).WewilllookatthecorrelationofourestimationwithexpressionofT-cellgenemodulesnext.
Insummary,thepredictedT-cellfractionfromourmethylationmarkershasveryhighcorrelationwiththeestimatesfromIFimagesandit'salsohighlycorrelatedwithpatientsurvival.
70
HIGH-THROUGHPUTGENETOKNOWLEDGEMAPPINGTHROUGHMASSIVEINTEGRATIONOFPUBLICSEQUENCINGDATA
BrianTsui,HannahCarter
DepartmentofMedicine,UniversityofCalifornia,SanDiegoTsui,BrianY.SequencingReadArchivecontainsmorethanonemillionrunsofpubliclyavailablesequencingdata.However,thelackofconsistentlypreprocessedsummaryandmolecularquantificationdata(forexample,geneexpressionquantificationforRNAseq)foreachsequencingrunhindersefficientBigDatainterpolation.Here,weintroduceSkymap,astandalonedatabasethatoffersasingle,multi-speciesdatamatrixincorporatingallpublicsequencingstudies.Thedatamatrixcontainsseveralomiclayers,includingexpressionquantification,allelicreadcounts,microbesreadcounts,chip-seq.Wereprocessedpetabytesofsequencingdatatogeneratethedatamatrixforeachdatatype.Wealsoofferareprocessedbiologicalmetadatafilethatdescribestherelationshipsbetweenthesequencingrunsandtheassociatedkeywords,extractedfromover3millionfreetextannotationsusingnaturallanguageprocessing.Theprocesseddatacanfitintoasingleharddrive(<500GB).Inhttps://github.com/brianyiktaktsui/Skymap,weshowcasehowonecan(1)retrieveandanalyzetheSNPsandexpressionofageneticvariantacross>250krunsinlessthanaminuteand(2)increasethetemporalresolutionfortrackinggeneexpressioninmousedevelopmentalhierarchy.
71
MANTA-RAE,PREDICTINGTHEIMPACTOFGENOMEVARIANTSONTHETRANSCRIPTIONFACTORBINDINGPOTENTIALOFREGULATORYELEMENTS
RobinvanderLee,PhillipA.Richmond,OriolFornes,WyethW.Wasserman
CentreforMolecularMedicineandTherapeutics-DepartmentofMedicalGenetics-BCChildren’sHospitalResearchInstitute-UniversityofBritishColumbia-Vancouver,
CanadavanderLee,RobinInterpretingthefunctionalimpactandpathogenicityofnoncodingvariantsremainschallenging.Increasingevidencesuggestsanimportantroleforalterationsthatimpactcis-regulatoryelementsandtranscriptionfactor(TF)bindingsites(TFBSs).WearedevelopingMANTA-RAE,atoolforMutationalANalysisofTfbsAlterationsbyReconstructionofAlteredregulatoryElements.MANTA-RAEwillpredicttheeffectsofvariantsonTFBSsinregulatoryelementsinathree-stepapproach:(i)reconstructingreferenceandalternativegenotypesbasedonuser-suppliedsetsofgenomicvariantsandregulatoryelements,(ii)predictingTFBSthroughsequencescanningwithcuratedTFbindingmodelsfromJASPAR,and(iii)deltaregulatorycapacityanalysisbycomparingtheTFBSpotentialofthereferenceandalternativesequences.MANTA-RAEwillhavethecapacitytoevaluate(i)bothlossesandgainsofTFBSsand(ii)changesbeyondsinglenucleotidevariants,includingsmallinsertions,deletions,andlargercopynumberchanges.Envisionedapplicationsincludeprioritizationofvariantsfromrarediseaseandcancergenomes.Thesefeaturesshouldcontributetoricherdetectionofregulation-alteringnoncodingvariantsthatmaycontributetodisease.
72
USINGQUANTITATIVEPHOSPHOPROTEOMICSTOUNDERSTANDFUNCTIONALSELECTIVITYOFRECEPTORTYROSINEKINASES
J.Watson,C.Francavilla,J.M.Schwartz
FacultyofBiology,MedicineandHealth,UniversityofManchesterWatson,JoanneCellsignallingistheprocessoftranslatingextracellularmessages,orsignals,totheinsideofthecellinordertocoordinatecellularactivity.Cellsreceivesignalsfromtheexternalenvironmentinamyriadofways,includingbythebindingofextracellularproteins,calledligands,toreceptorsonthesurfaceofthecell.Uponligandbinding,thesignalistransmittedacrossthecellsurfacebythereceptorandthesignalpropagatesthoughthecell,primarilybythepost-translationalmodificationofproteins.Forthereceptortyrosinekinase(RTK)family,thisprocessismediatedbyphosphorylation,amodificationwhichisaddedtoserine,threonineortyrosineresiduesofproteinsbytheactivityofkinasesandremovedbyphosphatases.Theadditionofphosphoryl-groupsisassociatedwithactivationofproteinfunction.LigandbindinginducesRTKdimerizationandactivationofkinaseactivity,allowingfullactivationofthereceptor.Thisinitiatesasequentialcascadeofproteinphosphorylation,ultimatelyregulatingtranscriptionfactoractivitytomodulatecellularbehavior.Anunansweredquestioninthefieldishowdifferentligandsbindingtothesamereceptorinducedistinctsignalingcascades,definedbychangesinphosphorylationdynamicsandconsequentcellularbehavior,aconceptknownasfunctionalselectivity.Thisisdemonstratedbyfibroblastgrowthfactor(FGF)-receptor2b;whenstimulatedbyeitherFGF7orFGF10anincreaseinproliferationormigrationrespectivelyisobserved.Quantitativephosphoproteomicsisapowerfulmethodforcomparingonaglobalscalethesignalingcascadesinducingthesedifferentbehaviors.Thiscomparisonwillallowustodefinepatternsofphosphorylationassociatedwithsignallingbydifferentligands,andusethistoidentifykeyphosphorylationsitesassociatedwithparticularcellbehaviors.Wehavedevelopedaworkflowtointerrogatetemporalphosphoproteomicsdatasetstodirectlycomparethephosphorylationdynamicsofcellsstimulatedbydifferentligands.Asproteinsmayhavemultiplephosphorylationsiteswhichcanhaveindependenteffectsandregulation,ourapproachconsidersdataonthelevelofboththephosphorylationsiteandassociatedprotein.Initialclusteringofphosphorylatedsiteswithsimilardynamicsovertimeisfollowedbyprotein-levelanalysisoffunctionalsimilarity,usingconnectivityingraphdatabases,enrichmentforontologicalterms,androlesinwell-studiedsignallingpathways(extractedfromKEGG).Subsequentstepsintheworkflowaimtomovetheanalysisfromtheproteintothephosphorylatedsites.Byintegratingnetwork-basedanalyseswithphosphoproteomicsdata,wewilldevelopnovelmethodsforunderstandingandvisualizingtheroleofphosphorylationinfunctionalselectivity.
73
ANERISAPPLIED:SPARK-ENABLEDANALYTICSFORFULL-SCALEANDREPRODUCIBLEANNOTATION-BASEDGENOMICSTUDIES
NicholasWheeler,JeremyFondran,PennyBenchek,JonathanHaines,WilliamS.Bush
CaseWesternReserveUniversityWheeler,NicholasModerngenomicstudiesarerapidlygrowinginscale,andtheanalyticalapproachesusedtoanalyzegenomicdataareincreasingincomplexity.Genomicdatamanagementposeslogisticandcomputationalchallenges,andanalysesareincreasinglyreliantongenomicannotationresourcesthatcreatetheirowndatamanagementandversioningissues.Asaresult,genomicdatasetsareincreasinglyhandledinwaysthatlimittherigorandreproducibilityofmanyanalyses.Inthiswork,wedescribeananalysisframeworkbasedonSparkinfrastructurethatprovidesmanagement,rapidaccess,andflexibleanalysisofgenomicdata.Bystoringlarge-scalegenomicandvariantannotationresourcesalongsidegenomicdatainadistributedsystem,weprovideefficientmethodsfortestingavarietyofbiologically-drivenhypothesesforrarevariants.Usingthewell-establishedSparkframeworkandanalysesdesignedusingJupyternotebooks,weprovidetoolsthatimproveprocessingspeed,reduceuser-drivendatapartitioning,andenhancethereproducibilityoflarge-scalegenomicstudies.
74
PUTTINGRELICANTHUSINITSPLACE:IMPACTOFMIXTUREMODELCHOICEONPHYLOGENETICRECONSTRUCTION
MadelyneXiao1,MercerR.Brugler2,EstefaniaRodriguez1
1DepartmentofInvertebrateZoology,AmericanMuseumofNaturalHistory,CentralParkWestat79thStreet,NewYork,NY10024;2BiologicalSciencesDepartmentNYC
CollegeofTechnology(CUNY),285JayStreet,Brooklyn,NY11201Xiao,MadelyneFirstdescribedin2006,Relicanthusdaphneaeisadeep-seaanthozoanthatlivesontheoceanfloornearhydrothermalventsintheEastPacific.Itwasoriginallyclassifiedasananemoneuntilaphylogeneticanalysisin2014calledthisclassificationintoquestion.ThetreeresultingfromamaximumlikelihoodanalysisfortheOrderActiniaria(anemones)placedRelicanthusoutsideofActiniaria;arecentanalysisofRelicanthus'mitochondrialgeneorder,however,suggestsitsmembershipamongtheanemones.Anongoingstudyseekstorelatethechoiceofmixturemodel(e.g.,maximumlikelihood,maximumparsimony,Bayesianinference)totheresultingphylogenetictree,takingintoaccounttherobustnessofthedatasetinquestion(numberofgenes,specimens,etc).Inparticular,weareinterestedintheimpactofmixturemodelchoiceontheplacementofRelicanthuswithrespecttotheactiniarians.
75
RATIONALDESIGNOFNOVELSKP2INHIBITORSUSINGDEEPNEURALNETWORKS
ShuxingZhang,BeibeiHuang,LonW.Fong
IntelligentMolecularDiscoveryLaboratory,DepartmentofExperimentalTherapeutics,MDAndersonCancerCenter,Houston,TX77054
Zhang,ShuxingRecentlyithasgainedmoreandmoreattentionwithdeeplearningtechniques,whichshowsignificantpromiseingeneratingpredictivemodelsforpharmaceuticalresearch.Inthepresentstudy,weattempttodevelopdeepneuralnetworksmethodtodesignnoveltherapeuticagentsfortriple-negativebreastcancer(TNBC)bytargetingacrucialE3ligaseSkp2.TNBCrepresentsabout20%ofbreast-cancercases.Itishighlyaggressivewithpoorclinicaloutcome,andnotargetedagentshavebeenshowntobeclinicallyeffectiveintreatingTNBC.Skp2isanF-boxprotein,constitutingoneofthefoursubunitsoftheSkp1-Cullin-1(Cul-1)-F-Box(SCF)ubiquitinE3ligasecomplex.EarlierstudiesshowedthatSkp2regulatescellcycleprogressionandproliferationbytargetingubiquitinationanddegradationofitssubstratessuchascellcycleinhibitorp27.Ourin-housedataalsorevealedthatSkp2wasoverexpressedinTNBCandcorrelatedwithpoorprognosis.Inaddition,werevealedthatgeneticSkp2inactivationalsotriggeredamassivecellularsenescenceand/orapoptosisresponseinap19Arf/p53-independent,butp27-dependentmanner.Takentogether,ourresultssuggestthattargetingSkp2mayrepresentageneral"pro-senescence/apoptosis"and"anti-glycolysis"approachandisapromisingtherapeuticstrategyforTNBCdevelopmentandmetastasis.Hereinwedevelopedanoveldeepneuralnetwork(DNN)methodtopredictTNBCcellresponsestodrugsbasedsolelyontheirchemicalfeatures.Inparticularacostfunctionwasemployedtosuppressoverfitting.Wealsoadoptedan"earlystopping"strategytofurtherreduceoverfitandimprovetheaccuracyofourmodels.Currentlythesoftwarehasbeenintegratedwithageneticalgorithm-basedvariableselectionapproachandimplementedaspartofourDL4DRpackage.WeobservedthatDL4DRcouldhandlebigdatasetefficiently,significantlyoutperformingothermethodsinmodel-buildingandpredictionandobtainingbetterresultsinbigdataanalysis.WhenemployedtopredictdrugresponsesofseveralhighlyaggressiveTNBCcelllines,DL4DRproducedrobustandaccuratepredictions.Therefore,weappliedtheseTNBCmodelstorationallydesignnewsmallmoleculeinhibitorsbytargetingSkp2.AfterscreeningofmillionsofchemicalcompoundsanddesigningnovelstructuresbasedonourleadcompoundZL25,weconductedaseriesofbiochemicalandcellularstudies.TheseexperimentalexaminationsdemonstratethatthetoprankedmoleculesindeedinhibitSkp2E3ubiquitinationfunctionssignificantlyandkillTNBCcellseffectively.HenceithasbeenusedforourleadoptimizationofSkp2inhibitors,andweanticipatethatDL4DRcanbeemployedasageneraltoolforhitidentificationandleadrationaldesignforcancertherapeuticsdevelopment.
76
PATTERNRECOGNITIONINBIOMEDICALDATA:CHALLENGESINPUTTINGBIGDATATOWORK
POSTERPRESENTATIONS
77
ODAL:AONE-SHOTDISTRIBUTEDALGORITHMTOPERFORMLOGISTICREGRESSIONSONELECTRONICHEALTHRECORDSDATAFROMMULTIPLE
CLINICALSITES
RuiDuan,MaryReginaBoland,JasonH.Moore,YongChen
DepartmentofBiostatistics,Epidemiology&Informatics,UniversityofPennsylvaniaChen,YongElectronicHealthRecords(EHR)containextensiveinformationonvarioushealthoutcomesandriskfactors,andthereforehavebeenbroadlyusedinhealthcareresearch.IntegratingEHRdatafrommultipleclinicalsitescanaccelerateknowledgediscoveryandriskpredictionbyprovidingalargersamplesizeinamoregeneralpopulationwhichpotentiallyreducesclinicalbiasandimprovesestimationandpredictionaccuracy.Toovercomethebarrierofpatient-leveldatasharing,distributedalgorithmsaredevelopedtoconductstatisticalanalysesacrossmultiplesitesthroughsharingonlyaggregatedinformation.Thecurrentdistributedalgorithmoftenrequiresiterativeinformationevaluationandtransferringacrosssites,whichcanpotentiallyleadtoahighcommunicationcostinpracticalsettings.Inthisstudy,weproposeaprivacy-preservingandcommunication-efficientdistributedalgorithmforlogisticregressionwithoutrequiringiterativecommunicationsacrosssites.Oursimulationstudyshowedouralgorithmreachedcomparativeaccuracycomparingtotheoracleestimatorwheredataarepooledtogether.WeappliedouralgorithmtoanEHRdatafromtheUniversityofPennsylvaniahealthsystemtoevaluatetherisksoffetallossduetovariousmedicationexposures.
78
PLATYPUS:AMULTIPLE-VIEWLEARNINGPREDICTIVEFRAMEWORKFORCANCERDRUGSENSITIVITYPREDICTION
KileyGraim1,VerenaFriedl2,KathleenE.Houlahan3,JoshuaM.Stuart3
1FlatironInstitute&PrincetonUniversity,2UniversityofCaliforniaSantaCruz,3Ontario
InstituteofCancerResearchFriedl,VerenaCancerisacomplexcollectionofdiseasesthataretosomedegreeuniquetoeachpatient.Precisiononcologyaimstoidentifythebestdrugtreatmentregimeusingmoleculardataontumorsamples.Whileomics-leveldataisbecomingmorewidelyavailablefortumorspecimens,thedatasetsuponwhichcomputationallearningmethodscanbetrainedvaryincoveragefromsampletosampleandfromdatatypetodatatype.Methodsthatcan"connectthedots"toleveragemoreoftheinformationprovidedbythesestudiescouldoffermajoradvantagesformaximizingpredictivepotential.Weintroduceamulti-viewmachine-learningstrategycalledPLATYPUSthatbuilds"views"frommultipledatasourcesthatareallusedasfeaturesforpredictingpatientoutcomes.Weshowthatalearningstrategythatfindsagreementacrosstheviewsonunlabeleddataincreasestheperformanceofthelearningmethodsoveranysingleview.Weillustratethepoweroftheapproachbyderivingsignaturesfordrugsensitivityinalargecancercelllinedatabase.CodeandadditionalinformationareavailablefromthePLATYPUSwebsitehttps://sysbiowiki.soe.ucsc.edu/platypus.
79
ASOFTWAREPIPELINEFORDETERMININGFINE-SCALETEMPORALGENOMEVARIATIONPATTERNSINEVOLVINGPOPULATIONSUSINGANON-PARAMETRIC
STATISTICALTEST
MinjungKwak1,SeokwooKang2,DongwonChoo2,DohyeonLee2,JinheeLee2,SeonghyeonKim2,GiltaeSong2
1YeungnamUniversity,2PusanNationalUniveristy
Song,GiltaeAbnormalvariationsarefrequentinclonalgenomeevolutionofcancers.Suchaberrationalvariationsoftenfunctionasadriverincancercellgrowth.Understandingfundamentalevolutionarydynamicsunderlyingthesevariationsintumormetastasisstillisunderstudiedowingtotheirgeneticcomplexity.Recently,wholegenomesequencingempowerstodeterminegenomevariationsinshort-termevolutionofcellpopulations.Thisapproachhasbeenappliedtoevolvingpopulationsofmodelorganismsincludingyeast.Itissubstantialprogressinevolutionarygenomicstoexaminesequencechangesatsuchfine-scaleresolution.However,existingstatisticaltestsforanalyzingvariationtemporalchangesinmultipletime-pointsarelimitedtoidentifythefullspectrumofintermediatechanges.WedesignedanewstatisticalapproachbasedonKolmogorov-Smirnovtestandintegrateditintoasoftwaretoolfordeterminingthevariationpatternsinfine-scaletemporalresolutioninexperimentalevolutionstudies.Wevalidatedourmethodusingsimulationdatathatmimictheevolutionoffruitflypopulations.WecomparedtheresultsofoursandotherexistingmethodssuchastheCochran-Mantel-Haenszel(CMH)testandthebeta-binomialGaussianprocess(BBGP)method.Weanalyzedyeast(Saccharomycescerevisiae)W303straingenomesfrom40populationsat12time-pointsusingoursoftwarepipeline.Ourtoolsetcanbealsoappliedforidentifyingabnormalvariationchangesinotherevolvingpopulations.
80
ADEEPLEARNINGAPPROACHTOIDENTIFYINGTHECELLULARCOMPOSITIONOFSOLIDTISSUEWITHDNAMETHYLATIONDATA
MeghanE.Muse1,CurtisL.Petersen1,CarmenJ.Marsit2,DianeGilbert-Diamond1,BrockC.Christensen1
1DartmouthCollege,2EmoryUniversity
Muse,MeghanDNAmethylationisinvolvedintheestablishmentofcellularidentityandmeasuredprofilesofDNAmethylationcanbeleveragedtodeconvolutetheunderlyingcellularcompositionofatissuesample.Currently,bothreference-basedandreference-freemethodsexisttoestimatetherelativeproportionofinferredcelltypesinsolidtissueusingDNAmethylationdata.However,establishingDNAmethylationlibrariesforreference-baseddeconvolutioninsolidtissuesischallenginganduseofreference-freeapproachestoestimateputativecelltypeproportionsarecomputationallyintensive,particularlyassamplesizeincreases.AsobservedpatternsinDNAmethylationcanbemoststronglyexplainedbytherelativeproportionofcelltypesinatissuesample,weinvestigatedtheutilityofimplementinganunsupervisedvariationalautoencoder(VAE)approachtolearnadefinednumberoflatentdimensionsinDNAmethylationdataandtestedtheirrelationshipwithinferredcelltypeproportionsfromareference-freeapproach.WeimplementtheTybaltmodeldevelopedbyWayetal.tolearnlatentrepresentationsofDNAmethylationdatameasuredontheIllumina450Karrayin334placentalsamples.Wecomparetheresultsofthismethodtothosefromawell-establishedreferencefreemethodforinferringtherelativeproportionsofputativecelltypes.Weconsideredmodelsthatlearned10to100latentdimensionsandselectedthemodelinwhichthegreatestnumberofputativecelltypesidentifiedbythereferencefreemethodhadmoderatecorrelation(r2>0.5)withatleastonelatentdimension.Thisresultedintheselectionofamodellearning10latentdimensions.Inthismodel,learnedlatentdimensionshadmoderatecorrelationwith5ofthe9putativeplacentalcelltypesidentifiedbythereferencefreemethodandstrongcorrelation(r2>0.7)with2putativeplacentalcelltypes.Tobetterunderstandtheunderlyingbiologyrepresentedbytheselatentdimensions,weassesstheCpGlocimoststronglycorrelatedwiththeactivationsofthese5latentdimensionsasameansofidentifyinggenesthatarerepresentativeofcellularidentity.
81
DIRECTLYMEASURINGTHERATEANDDYNAMICSHUMANMUTATIONBYSEQUENCINGLARGE,MULTI-GENERATIONALPEDIGREES
ThomasA.Sasani,BrentS.Pedersen,MarkLeppert,RayWhite,LisaBaird,AaronR.Quinlan,LynnB.Jorde
DepartmentofHumanGenetics,UniversityofUtah
Quinlan,AaronDevelopinganaccurateestimateofthehumangermlinemutationrateiscriticaltoourunderstandingofevolution,demography,andgeneticdisease.Earlyphylogeneticanalysesinferredmutationratesfromtheobservedsequencedivergencebetweenhumansandrelatedprimatespeciesatparticulargenesandpseudogenes.However,aswholegenomesequencinghasbecomeubiquitous,theseestimateshavebeenrefinedusingpedigree-basedapproaches.Byidentifyingmutationspresentinoffspringthatareabsentfromtheirparents(denovomutations),itispossibletomoreaccuratelyapproximatethehumangermlinemutationrate.Toobtainaprecise,unbiasedestimateofthemutationrateinhumans,weperformeddeepwhole-genomesequencingonblood-derivedDNAfrom34oftheoriginalthree-generationCEPHfamiliesfromUtah,comprisingatotalof604individuals.Thesefamilies,whicheachcontaingrandparents(P0generation),parents(F1),andtheirchildren(F2),areconsiderablylargerthananyusedinpriorestimatesofthehumanmutationrate,andofferuniquepowertodetectandvalidatedenovomutation.Withamedianof8F2individualsperpedigree,wewereabletobiologicallyvalidateputativedenovomutationsintheF1generationbyassessingtheirtransmissiontoathirdgeneration.Usingthisdataset,wehavegeneratedahigh-confidenceestimateofthehumanmutationrate(1.31x10-8/bp/generation),observeasignificantparentalageeffectontherateofdenovomutation,andidentifywidevariabilityinfamily-specificageeffectsacrossCEPHpedigrees.Toourknowledge,thisstudyrepresentsthefirstexampleofalongitudinalanalysisoftheeffectofparentalagewithinindividualfamilies.Additionally,wehaveidentifiedrecurrentdenovovariantspresentinmultipleF2offspring,whicharelikelytheresultofmosaicismintheparentalgermline.Finally,wehavetrainedaclassificationmodelonthehigh-quality,transmitteddenovovariantsinourdataset,andusedthismodeltoidentifydenovomutationsinalargecohortofchildrenfromtheSimonsFoundationforAutismResearchInitiative(SFARI).Combiningthedenovomutationsobservedin34UtahfamilieswiththeSFARIcallset,wehavegeneratedadensegenomicmapofspontaneoushumanmutation.Weobserveregionalenrichmentofdenovovariationinthehumangenome,andexploretheroleofsequencecontext,aswellasmolecularprocesseslikerecombinationandgeneconversion,ontherateofhumanmutation.
82
AVAILABLEPROTEIN3DSTRUCTURESDONOTREFLECTHUMANGENETICANDFUNCTIONALDIVERSITY
GregorySliwoski,NeelPatel,R.MichaelSivley,CharlesR.Sanders,JensMeiler,WilliamS.Bush,JohnA.Capra
DepartmentofBiomedicalInformatics,VanderbiltUniversityMedicalCenter,Nashville,
TN,USA,CenterforStructuralBiology,VanderbiltUniversity,Nashville,TN,USA;InstituteforComputationalBiology,DepartmentofPopulationandQuantitativeHealth
Sciences,CaseWesternReserveUniversity,Cleveland,OH,USA;DepartmentofBiochemistry,VanderbiltUniversity,Nashville,TN,USA;DepartmentofMedicine,VanderbiltUniversityMedicalCenter,Nashville,TN,USA;DepartmentofChemistry,
VanderbiltUniversity,Nashville,TN,USA;InstituteforComputationalBiology,DepartmentofPopulationandQuantitativeHealthSciences,CaseWesternReserve
University,Cleveland,OH,USA;DepartmentofBiologicalSciences,VanderbiltUniversity,Nashville,TN,USA;VanderbiltGeneticsInstitute,VanderbiltUniversityMedicalCenter,
Nashville,TN,USABush,WilliamGenomicdatabasesandclinicaltrialsaresubstantiallybiasedtowardsEuropeanancestrypopulations,andthisbiassignificantlycontributestohealthdisparities.Structuralbiologyhasanessentialroleininvestigatingproteinfunctionandclinicalvariantinterpretation,providingpowerfultoolsforinvestigatingtheimpactofgeneticvariantsonproteinstructureandfunction.However,studiesthatanalyzethe3Dstructureofproteinstypicallyconsiderasinglecanonicalaminoacidsequenceasrepresentativeoftheprotein.Here,weevaluatethepotentialforthissimplificationtobiasresultstowarddifferentpopulationsbyevaluatinghowwell66,971experimentallycharacterizedhumanprotein3Dstructuresrepresentthesequencediversityoftheproteinstheymodel.Thousandsofproteinstructureshaveunrepresentedalternativesequencescommonlyfoundinhumanpopulations,andAfricanancestryindividuals'sequencesaretheleastlikelytoberepresentedbyavailablestructures.Becausesequencevariabilityisoftenlimitedtoafewpositionswithinaprotein,weevaluatethelikelihoodofthesesmallchangestoinfluenceproteinfunction.Combiningexistingannotationsandcomputationalmodeling,weidentifythousandsofproteinsforwhichuseofasinglestructureasrepresentativeof"wildtype"maybiasresultsagainstcertainpopulationsorindividuals.Variantssegregatinginhumanpopulations,butunrepresentedinstructures,areobservedacrossfunctionalsitesinvolvedinstability(134disulfidebondcysteines),regulation(94phosphorylationsites),DNAbinding(322residues),smallmoleculebinding(1,463residueswith362withindrugbindingsites),andprotein-proteininterfaces(6,144residues).Wecomputationallymodelmorethan700unrepresentedvariants'effectsonproteinstabilityandprotein-proteininteraction.Changesinpredictedproteinstabilityarefoundfor28%(156)ofthe556variants,withstabilizing(41)anddestabilizing(115)effectspredicted.Of161protein-interfacevariantsmodeled,25%(41)arepredictedtoimpactprotein-proteinbinding.Thesevariantsinhumanpopulationshavepotentialtoimpactthestudyoftheirprotein'sstructureandfunction.Withthewidespreaduseofproteinstructuresinbasicscienceandclinicalvariantinterpretation,humanproteinsequenceandstructuraldiversitymustbeconsideredtoenableaccurateandreproducibleconclusionsfromstructuralanalyses.
83
SEMANTICWORKFLOWSFORBENCHMARKCHALLENGES:ENHANCINGCOMPARABILITY,REUSABILITYANDREPRODUCIBILITY
ArunimaSrivastava1,RavaliAdusumilli2,HunterBoyce2,DanielGarijo3,VarunRatnakar3,RajivMayani3,ThomasYu4,RaghuMachiraju1,YolandaGil3,ParagMallick2
1TheOhioStateUniversity,2StanfordUniversity,3UniversityofSouthernCalifornia,4Sage
BionetworksSrivastava,ArunimaBenchmarkchallenges,suchastheCriticalAssessmentofStructurePrediction(CASP)andDialogueforReverseEngineeringAssessmentsandMethods(DREAM)havebeeninstrumentalindrivingthedevelopmentofbioinformaticsmethods.Typically,challengesareposted,andthencompetitorsperformapredictionbaseduponblindedtestdata.Challengersthensubmittheiranswerstoacentralserverwheretheyarescored.RecenteffortstoautomatethesechallengeshavebeenenabledbysystemsinwhichchallengerssubmitDockercontainers,aunitofsoftwarethatpackagesupcodeandallofitsdependencies,toberunonthecloud.Despitetheirincrediblevalueforprovidinganunbiasedtest-bedforthebioinformaticscommunity,thereremainopportunitiestofurtherenhancethepotentialimpactofbenchmarkchallenges.Specifically,currentapproachesonlyevaluateend-to-endperformance;itisnearlyimpossibletodirectlycomparemethodologiesorparameters.Furthermore,thescientificcommunitycannoteasilyreusechallengers'approaches,duetolackofspecifics,ambiguityintoolsandparametersaswellasproblemsinsharingandmaintenance.Lastly,theintuitionbehindwhyparticularstepsareusedisnotcaptured,astheproposedworkflowsarenotexplicitlydefined,makingitcumbersometounderstandtheflowandutilizationofdata.HereweintroduceanapproachtoovercometheselimitationsbasedupontheWINGSsemanticworkflowsystem.Specifically,WINGSenablesresearcherstosubmitcompletesemanticworkflowsaschallengesubmissions.Bysubmittingentriesasworkflows,itthenbecomespossibletocomparenotjusttheresultsandperformanceofachallenger,butalsothemethodologyemployed.Thisisparticularlyimportantwhendozensofchallengeentriesmayusenearlyidenticaltools,butwithonlysubtlechangesinparameters(andradicaldifferencesinresults).WINGSusesacomponentdrivenworkflowdesignandoffersintelligentparameteranddataselectionbyreasoningaboutdatacharacteristics.Thisprovestobeespeciallycriticalinbioinformaticsworkflowswhereusingdefaultorincorrectparametervaluesispronetodrasticallyalteringresults.Differentchallengeentriesmaybereadilycomparedthroughtheuseofabstractworkflows,whichalsofacilitatereuse.WINGSishousedonacloudbasedsetup,whichstoresdata,dependenciesandworkflowsforeasysharingandutility.ItalsohastheabilitytoscaleworkflowexecutionsusingdistributedcomputingthroughthePegasusworkflowexecutionsystem.WedemonstratetheapplicationofthisarchitecturetotheDREAMproteogenomicchallenge.
84
PRECISIONMEDICINE:IMPROVINGHEALTHTHROUGHHIGH-RESOLUTIONANALYSISOFPERSONALDATA
POSTERPRESENTATIONS
85
CLASSPRIORESTIMATIONANDQUANTIFICATIONOFTHELOSSANDGAINOFRESIDUEFUNCTIONUPONMUTATION
ShantanuJain1,JoseLugo-Martinez2,MarthaWhite3,MichaelW.Trosset4,PredragRadivojac1
1NortheasternUniversity,2Carnegie-MellonUniversity,3UniversityofAlberta,4Indiana
UniversityJain,ShantanuStandardalgorithmsforbinaryclassificationassumeaccesstolabeleddatafromboththepositiveandthenegativeclass.However,inmanybiologicalproblems,labeledexamplesfromoneoftheclasses(say,negatives)isnotavailable.Inthisscenario,apositive-unlabeledlearner,thatreliesonpositiveandunlabeledexamplesonly,isused.Surprisingly,thisstrategyleadstoanoptimalscorefunction.However,pickinganoptimalthresholdtoconstructthefinalclassifierrequirestheknowledgeoftheclasspriors,theproportionofpositivesandnegativesintheunlabeleddata.Iwill1)presentannonparametricalgorithmforestimationoftheclasspriorsbasedonamixturemodelformulation,2)elucidatetheassumptionsnecessaryforthealgorithm,and3)deriveaclasspriorpreservingunivariatetransformfordimensionalityreductionandtherebyobtainapracticalalgorithmformultivariatedata.Moreover,Iwillalsodemonstratehowtheposteriorcanbeestimatedusingtheestimateoftheclasspriors.Iwillfurtherextendtheseresultstoamoregeneralsettingwheresomeoftheexampleslabeledaspositiveareinfactnegative.Iwillpresentexperimentalresultsdemonstratingtheefficacyofouralgorithm,comparingitwiththestateoftheartmethodsandotherbaselinemethodsonmanyrealandsyntheticdatasets.Lastly,Iwillpresentabiologicalapplicationofthisworktoestablishthelossandgainofresiduefunctionasacommonmechanismforinheriteddiseases.
86
PREDICTIONOFTIMETOINSULINUSINGCLINICALANDGENETICBIOMARKERSINTYPE2DIABETESPATIENTS
RikkeLinnemannNielsen1,LouiseDonnelly2,AgnesMartineNielsen3,KonstantinosTsirigos1,KaixinZhou2,BjarneErsboell3,LineClemmensen3,EwanPearson2,Ramneek
Gupta1
1DepartmentofBioandHealthInformatics,TechnicalUniversityofDenmark;2MedicalResearchInstitute,UniversityofDundee,UnitedKingdom;3DepartmentofApplied
MathematicsandComputerScience,TechnicalUniversityofDenmarkNielsen,RikkeLinnemannTypeIIdiabetes(T2D)isacomplexmetabolicdisorderwheretheriskofafastorslowdiseaseprogressionishighlydependentofeachindividual.Therefore,itisusefultoidentifypredictivebiomarkersfordiabetesprogressionandrelevantpatientsubgroupscharacteristicsthatmayassistclinicaldecisionsinT2Dtreatmentmanagement.Inthisstudy,weobtainedelectronicmedicalrecordsfromacohort-basedpopulationinTayside,UKregisteredfromDecember1994toSeptember2015.Usinglife-styledata,anthropometry,biochemicaldata,drug-prescriptiondataandgeneticfeaturesfromelectronicmedicalrecordson6871T2Dpatients,artificialneuralnetworkmodels(ANN)weretrainedwithtwo-layercross-validationtoclassifyT2Dpatients’progressiongivenaspatients’timetoinsulin(TTI).TTIwasdefinedasthefirstdayofinsulintreatmentorastheclinicalneedforinsulin(HbA1c>8.5%treatedwithtwoormorenon-insulindiabetestherapies).PredictiontargetswereTTIwithinyear1,3or5fromthetimeofdiagnosis.GeneticvariantswereselectedbypriorknowledgeonT2DandglycemictraitpredispositionSNPsfrom~80MimputedSNPs.Predictionmodelswereinvestigatedforunderstandingwhichbiomarkersweremostpredictiveofprogression.ANNswithalldataexceptgeneticvariantspredictedTTIforyear1(0.92±0.02,0.83±0.04,0.86±0.04forAUC,sensitivityandspecificity,respectively),year3(0.82±0.03,0.71±0.05,0.78±0.04)andyear5(0.78±0.02,0.66±0.02,0.76±0.02).MostimportantfeaturesincludedHbA1c,GADantibodyconcentrationandthetypeofdiabetestherapypatientswerereceivingatthetimeofconfirmeddiagnosis.Integrationofgeneticvariants,usingaforwardselectionstrategy,resultedinaslightlyimprovedperformanceinallthreemodels;year1(0.94±0.01,0.83±0.03,0.90±0.01),year3(0.85±0.02,0.72±0.05,0.80±0.02),andyear5(0.80±0.03,0.68±0.04,0.78±0.02).WearecurrentlyexaminingtherobustnessoftheselectedSNPsbybuildinganensembleofmultiplemodelswithdifferentfeaturesandinvestigatingifthegeneticfeaturesarerelevanttospecificpatientsubgroups,aswellascarryingoutfurtherlongitudinalworkwiththephenotypetoincludemoreinformationaboutagivenpatientusinglongitudinalpatientinformationacrossirregularsampledtimepoints.
87
PATHOGENICITYANDFUNCTIONALIMPACTOFINSERTION/DELETIONANDSTOPGAINVARIATIONINTHEHUMANGENOME
KymberleighA.Pagel1,DannyAntaki2,MatthewMort3,DavidN.Cooper3,JonathanSebat2,LiliaM.Iakoucheva2,SeanD.Mooney4,PredragRadivojac5
1IndianaUniversity,2UniversityofCaliforniaSanDiego,3CardiffUniversity,4Universityof
Washington,5NortheasternUniversityRadivojac,PredragAnindividualhumanexomemaycontainhundredsofprotein-codinginsertion/deletions(indels)anddozensofproteintruncatingvariants.Accuratedifferentiationbetweenphenotypicallyneutralanddisease-causinggeneticvariationremainsanopenproblem,particularlyamongtheexcessofindelvariantsbroughtaboutbyrecentdevelopmentsinsequencingtechnologies.Indelandproteintruncatingvariantsexhibitdiverseimpactonproteinsequence,fromasingleresiduetothedeletionofentirefunctionaldomains.Wepresentmachinelearningmethodstopredictthepathogenicityandthetypesoffunctionalresiduesimpactedbyloss-of-functionandindelvariation.Themodelsshowgoodpredictiveperformanceandthepotentialtoidentifyeffectuponresidespredictedtoeffectstructuralandfunctionalfeatures,includingsecondarystructure,intrinsicdisorder,metalandmacromolecularbinding,post-translationalmodifications,andcatalyticresidues.WeidentifystructuralandfunctionalmechanismsthatareimpactedpreferentiallybygermlinevariationfromtheHumanGeneMutationDatabase,recurrentsomaticvariationinCOSMIC,anddenovovariationfromindividualswithneurodevelopmentaldisorders.Collectively,thepathogenicitypredictionandpredictedfunctionaleffectsprovideaframeworktofacilitatetheinterrogationofindelandproteintruncatingvariants.
88
DETECTINGPOTENTIALPLEIOTROPYACROSSCARDIOVASCULARANDNEUROLOGICALDISEASESUSINGUNIVARIATE,BIVARIATE,ANDMULTIVARIATE
METHODSON43,870INDIVIDUALSFROMTHEEMERGENETWORK
XinyuanZhang1,YogasudhaVeturi1,ShefaliS.Verma1,WilliamBone1,AnuragVerma1,AnastasiaM.Lucas1,ScottHebbring2,JoshuaC.Denny3,IanStanaway4,GailP.Jarvik4,DavidCrosslin4,EricB.Larson5,LauraRasmussen-Torvik6,SarahA.Pendergrass7,JordanW.Smoller8,HakonHakonarson9,PatrickSleiman9,ChunhuaWeng10,DavidFasel10,Wei-
QiWei3,IftikharKullo11,DanielSchaid11,WendyK.Chung10,MarylynD.Ritchie1
1UniversityofPennsylvania,2MarshfieldClinic,3VanderbiltUniversity,4Universityof
Washington,5KaiserPermanenteWashingtonHealthResearchInstitute,6NorthwesternUniversity,7GeisingerHealthSystem,8MassachusettsGeneralHospital,9Children's
HospitalofPhiladelphia,10ColumbiaUniversity,11MayoClinicZhang,XinyuanThelinkbetweencardiovasculardiseasesandneurologicaldisordershasbeenwidelyobservedintheagingpopulation.Diseasepreventionandtreatmentrelyonunderstandingthepotentialgeneticnexusofmultiplediseasesinthesecategories.Inthisstudy,wewereinterestedindetectingpleiotropy,orthephenomenoninwhichageneticvariantinfluencesmorethanonephenotype.Marker-phenotypeassociationapproachescanbegroupedintounivariate,bivariate,andmultivariatecategoriesbasedonthenumberofphenotypesconsideredatonetime.HereweappliedonestatisticalmethodpercategoryfollowedbyaneQTLcolocalizationanalysistoidentifypotentialpleiotropicvariantsthatcontributetothelinkbetweencardiovascularandneurologicaldiseases.Weperformedouranalyseson~530,000commonSNPscoupledwith65electronichealthrecord(EHR)-basedphenotypesin43,870unrelatedEuropeanadultsfromtheElectronicMedicalRecordsandGenomics(eMERGE)network.Therewere31variantsidentifiedbyallthreemethodsthatshowedsignificantassociationsacrosslateonsetcardiac-andneurologic-diseases.Wefurtherinvestigatedfunctionalimplicationsofgeneexpressiononthedetected"leadSNPs"viacolocalizationanalysis,providingadeeperunderstandingofthediscoveredassociations.Insummary,wepresenttheframeworkandlandscapefordetectingpotentialpleiotropyusingunivariate,bivariate,multivariate,andcolocalizationmethods.Furtherexplorationofthesepotentiallypleiotropicgeneticvariantswillworktowardunderstandingdiseasecausingmechanismsacrosscardiovascularandneurologicaldiseasesandmayassistinconsideringdiseasepreventionaswellasdrugrepositioninginfutureresearch.
89
PHARMGKB:THEAPIANDINFOBUTTONS
MichelleWhirl-Carrillo1,RyanM.Whaley1,MarkWoon1,RussB.Altman2,TeriE.Klein3
1DepartmentofBiomedicalDataScience,StanfordUniversity;2DepartmentofBioengineering,MedicineandGenetics,StanfordUniversity;3DepartmentofBiomedical
DataScienceandMedicine,StanfordUniversityAlena,OrlenkoWithPharmGKBisthelargestpubliclyavailableresourceforpharmacogenomics(PGx)discoveryandimplementation.Itsmissionistocollect,curate,integrateanddisseminateknowledgeabouthowhumangeneticvariationinfluencesdrugresponse.PharmGKBknowledgeisdefinedbyadatamodel,storedinadatabase,andaccessedthroughtheApplicationProgrammingInterface(API).TheAPIsuppliesdatatothewww.pharmgkb.orgwebsitewhichisthemostcommonwayforpeopletoqueryandviewtheknowledgecontentofPharmGKB.Additionally,thePharmGKBAPIsupportstheInfoButtonspecificationwhichisusedontheClinGenwebsiteaswellasbyothersintheirEHRsystems.TheInfobuttonImplementationGuideprovidesastandardmechanismforEHRsystemstosubmitknowledgerequeststoknowledgeresourcesovertheHTTPprotocolforpoint-of-caredecisionsupport.PharmGKBprovidesthisaspartofitsstandardAPI,usingRXCUIs(RxNormconceptuniqueidentifiers)andnormalizationofdrugnames,andreturnsHTML,withplanstosupportJSONandXML(https://api.pharmgkb.org/infobutton.html).ForInfoButtons,theEHRdisplaysabuttonfortheusertoclickthatwillquerythePharmGKBanddisplayinformationdirectlyintheEHRapplication.Insidetheapplication,alistofdrugidentifiers(RXCUIs)arecreatedandthensubmittedtotheInfoButtonservice’sURL.TheURLthenreturnsareportinHTMLthatisdisplayedtotheEHRuserdirectlyintheinterface.ThePharmGKBInfoButtonimplementationdisplaysdosingguidelineannotations,druglabelannotations,andtop-levelclinicalannotationsthatarerelevanttothedrugidentifiersprovidedbytheuser.WemonitortheAPIrequestlogstoassessusage.
90
SINGLECELLANALYSIS–WHATISINTHEFUTURE?
POSTERPRESENTATIONS
91
INTRATUMORHETEROGENEITY(ITH)METRICOFCIRCULATINGTUMORCELL(CTC)-DERIVEDXENOGRAFTMODELSINSMALLCELLLUNGCANCER.
YuanxinXi1,C.AllisonStewart2,CarlM.Gay2,HaiTran2,BonnieGlisson2,JohnV.Heymach2,PaulRobson3,LaurenA.Byers2,JingWang1
1DepartmentofBioinformaticsandComputationalBiology,TheUniversityofTexasMDAndersonCancerCenter,Houston,TX,USA;2DepartmentofThoracic/Head&NeckMedicalOncology,TheUniversityofTexasMDAndersonCancerCenter,Houston,TX,
USA;3TheJacksonLaboratoryforGenomicMedicine,Farmington,CT,USAXi,YuanxinSmallcelllungcancer(SCLC)isanaggressivemalignancycharacterizedbyrapidonsetofplatinum-resistance.Onceconsideredahomogeneousdisease,recentanalysesofSCLChaveshownintra-tumoralheterogeneity(ITH)associatedwithtreatment-resistance.Tofurtherinvestigatethecontributionofintra-tumoralheterogeneity(ITH)toclinicaloutcomesinSCLC,weprofiledsingle-cellRNAseqexpressionofcirculatingtumorcell(CTC)-derivedxenograft(CDX)modelsfromSCLCpatientsthatrecapitulatepatienttumorgenomicsandresponsetoplatinumchemotherapy.Characterizingtheheterogeneityoftumorcellsubpopulationsremainsabioinformaticschallengeinanalyzingsingle-cellRNAseqdataforCTC-derivedCDXmodels,mostlyduetolackofanaccuratemethodtoquantifythecomplexityoftumorcellexpressionpatternsatsinglecellresolutionanddiscoverthecorrelationswithdifferenttumordevelopmentortreatmentresponsemechanisms.Inthisstudy,wedevelopedavariance-basedmetrictomeasuretheoverallheterogeneityoftumorcellpopulationsbasedonsinglecellRNAseqexpressionprofilesWeappliedthismetrictotheChromium10xsinglecellRNAseqdataof4SCLCCDXmodelsthathasdifferentplatinumtreatmentresponses,andidentifiedaglobalincreaseofintra-tumorheterogeneityinplatinum-resistantmodelscomparedwithplatinum-sensitivemodels,anddefinedvariablegeneexpressionasareliablehallmarkofincreasingtherapeuticresistanceinSCLC.Furthergenesetenrichmentanalysis(GSEA)ofthetreatmentnaïveandrelapsedsamplesrevealedthattheincreasedITHmetricwereassociatedwithmultipleconcurrentresistancemechanisms,suggestingthatresistancetomolecularlytargetedtherapiesdoesnotfollowapredictable,reproduciblepathwaywithinthesameCDXmodel.Theseresultsshowedthatthevariance-basedITHmetricsuccessfullycharacterizedtheresistanceassociatedheterogeneityincreasesinSCLCtumorcells,andmorebroadly,itprovidesageneralpurposequantitativemeasurementofthetumorcellsubpopulationheterogeneityinsinglecellanalysis.
92
WHENBIOLOGYGETSPERSONAL:HIDDENCHALLENGESOFPRIVACYANDETHICSINBIOLOGICALBIGDATA
POSTERPRESENTATIONS
93
QUANTIFYINGTHEIDENTIFIABILITYOFINDIVIDUALSUSINGASPARSESETOFSNPS
PrashantS.Emani,GamzeGursoy,MarkB.Gerstein
DepartmentofMolecularBiophysicsandBiochemistry,YaleUniversityEmani,PrashantTherecentrevolutioninhigh-throughputgenomicshasledtotheproliferationofpubliclyavailabledatasetsanddatabasesenablingqueriesonindividualgenotypes,whetherintheformofreferencegenotypes,singlenucleotidepolymorphism(SNP)"beacons"orfunctionalgenomicsdatawithsignificantidentifying-informationleakage.ItisthereforeofinteresttoquantifythepowerofasparsesetofSNPstorevealtheidentityofanindividual,asthiswouldhelpdeterminetheprivacyrisksofmakingparticulardatasetsaccessibletotheresearchcommunity.Suchanevaluationwouldenableaprincipledcost-benefitanalysistodeterminetherightbalanceofpublicandprivatedataaccessibility.Wepresentatoolforsuchquantificationbasedonwell-establishedHiddenMarkovModels(HMMs)ofchromosomalrecombination(LiandStephens,2003):thecentralideaistoexplorethestatespaceofreferencehaplotypesfromadatabase,andfindthetrajectorythroughthisspacethatbestdescribesobservedgenotypes.ThetoolenablessimpleSNP-basedkinshipanalysisbytheidentificationofqueriedindividualsasa"mosaic",orpiecewisecombination,oftheinputreferencehaplotypes,whileallowingforgenotypingerroranddenovomutation.Theoutputincludesthebest-fitreferencehaplotypetrajectories,whichforasmallsetofinputSNPs,couldresultinseveralequal-probabilitypossibilities.However,eveninthiscase,inferencescouldbemadeonthemembershipofanindividualincertainhaplotypecommunitiesbasedontheirenrichmentwithinthebest-fittrajectories.Thisapproachparallelslinkagedisequilibrium-(LD-)basedmethods,butavoidsanyassumptionsofpopulationhomogeneityasitdoesnotrequireexplicitcalculationofallelefrequenciesorSNPcorrelations.Itis,ofcourse,dependentontheavailabilityofasufficientlyrichdatabasetoensurethatthequeriedindividualisatleastrelated.Thislimitationisfastbecominganon-issue,however,withtheconstantexpansionofpopulation-levelgenotypedatabases.Theresultsofrepresentativesimulationsusingthe1000GenomesreferencedatasetwithrandomlychosencommonSNPs(allelefrequency>0.05)fromasinglechromosomeare:searchingforagenotypedindividualamong100phasedgenotypes(=200referencehaplotypes)yieldedaccuratediscoverywithasfewas12SNPs;includingamutationrateof0.1–0.2increasedthenumberofSNPsrequiredforreliableidentificationto~25;simulationsofmosaicsamplescomposedoftworeferenceindividuals,eachcontributinghalfoftheSNPs,suggestedthat~30SNPscouldbesufficienttoidentifythetwoconstituentindividuals.Thesenumberswouldlikelybeimproveduponwhenallchromosomesarecombined.Insummary,weprovideatoolthatcanservetoidentifyobservedgenotypeseitherknowntobemembersofadatabase,orrelatedtoindividualswithinthedatabase,undervaryingconditionsofmutationandrecombinationrateswithnoassumptionsaboutthepopulation-specificallelefrequenciesofSNPs.
94
TRANSCRIPTOMICSUMMARYSPLICINGDATAMAYLEAKPERSONALPRIVATEINFORMATIONBYCOMPUTATIONALLINKAGETOTHEGENOMICVARIANTS
ZhiqiangHu1,MarkB.Gerstein2,StevenE.Brenner1
1UniversityofCalifornia,Berkeley,2YaleUniversityBrenner,StevenSharinggenomeswithoutpersonalidentifiershasbeencommonpracticeinbiologicalandmedicalresearch.However,recentstudiesrevealedtheriskofre-identifyingpeoplefromtheirgenomes,orattachedquasi-identifiers,suchassex,birthdate,andzipcode.Moreover,consumerdatabasesnowcontaingeneticdataformillionsofindividuals;arecentstudysuggestedthatmostAmericanshavedetectablefamilyrelationshipsinthesedatabases,allowingtheiridentificationusingdemographicidentifiers.Theadditionalavailabilityofanindividual’sRNA-seqdatahasimplicationsforprivacy,asitmaybelinkedtothegenome,potentiallyallowingtheperson’sprivacytobebreached.Forexample,sexandethnicityinformationmaybeinferreddirectlyfromagenome,andthestudymayprovideazipcode.ThisgenomecouldbelinkedtoRNA-seqdatafromadiabetesstudywithattachedbirthdatesandincome.Thesecombinedquasi-identifiersmayuniquelyidentifytheperson,andthestudyrevealstheperson’sdiabetesdiseasestatus.NEWPARAGRAPHRNA-seqreadscontaingeneticvariants,andthuscanbedirectlylinkedtothegenome.Toavoidthisrisk,someresearchersnowreleasegeneexpression,isoformexpressionandexonreadcountdatainsteadoftherawsequencingreads.NEWPARAGRAPHHowever,geneexpressioncanalsobelinkedtothegenomebasedonexpressionQTLs(eQTLs).UsingaBayesianframework,wefoundthatitisfeasibletopredictgenomicvariantsfromsummarizedsplicingdata.BasedonGTExsplicingQTLs(sQTL)data,usingrelativeisoformexpressionfrom15genes,wecouldidentifythetargetgenomewithinapoolcontaininghundredsofindividualswith>90%accuracy.WecouldalsolinkRNA-seqdatafromacertaintissueorcelltypetothegenomeusingparameterstrainedfromasimilartissue,indicatingparameterstrainedonmajortissuesmayenablethelinkageofRNA-seqfromalltypesofhumansamplestothegenome.ByquantitativelymeasuringtheinformationleakagefromeachsQTL,wefoundthatitispossibletoidentifythetargetgenomeofanRNA-seqdatasetfrommillionsofindividualsusingmoresQTLs.ResearchershaveproposedtoeliminatetheriskofeQTL-basedlinkingattacksbyaddingnoisetothegeneexpressions,basedontheobservationthatonlyafewgenesenablelinkage.However,ourframeworksuggestedthattherearenowmanymoresuchgenesthanpreviouslyreported.Wefindthatexpressiondataenablesthere-identificationoftargetgenomefromapoolcontainingbillionsofgenomes.Ourresultimpliesthatmitigationofthelinkingriskbyaddingnoisewouldseverelyabrogatebiologicalentityofthedata,sincethedatawillnolongerbebiologicallymeaningfulwhenoverhalfofgeneexpressionsaremodified.Ourstudyalsoimpliesthatotherkindsof“omic”data,includingDNAmodificationandproteinmetabolitelevels,mayalsoleakgenomeprivacy.
95
WORKSHOP:MERGINGHETEROGENEOUSDATATOENABLEKNOWLEDGEDISCOVERY
POSTERPRESENTATION
96
TOSEARCHAHETNET...HOWARETWONODESCONNECTED?
DanielHimmelstein1,MichaelZietz1,KyleKloster2,MichaelNagle3,BlairSullivan2,CaseyS.Greene1
1UniversityofPennsylvania,2NorthCarolinaStateUniversity,3PfizerInc.
Himmelstein,DanielNetworkswithmultiplenodeandrelationshiptypes,calledhetnets,provideanidealdatastructuretointegratebiomedicalknowledge.Oneexample,Hetionet,has47thousandnodesof11typesand2.25millionrelationshipsof24typescoveringdiseases,smallmoleculedrugs,andtheentitiesinbetween,whichrangefrommolecular(e.g.genes&pathways)toorganismal(e.g.sideeffects&symptoms).WearebuildingasearchengineforhetnetconnectivityontheHetionetnetwork.Wewanttoprovideuserswithanimmediateanswertothequestion,"howarethesetwonodesconnected?"Weapproachthisproblembyidentifyingtypesofpathswhereasourceandtargetnodeareconnectedmorethanexpectedbychance(i.e.basedontheirdegreesalone).WhilestillaworkinprogressonGitHub(https://github.com/greenelab/hetmech),theprojectisnearingaprototypewebapplication.Reachingthisstagerequiredseveralmethodologicaladvances.First,weimplementedefficientpathcountingalgorithmsinPythonbasedonmatrixmultiplication.AnewHetMatdatastructureprovidesefficienton-diskstorageofhetnets,optimizedformatrixoperationsandcaching.Wedesignedanovelgamma-hurdlemethodforassessingthenulldistributionofadegree-weightedpathcount(DWPC)foragivenpairofsource-targetnodedegrees.Usingthesetechniques,wecomputedmeasuresofconnectivitybetweenallnode-pairsforthe2,205typesofpaths(metapaths)withlength≤3inHetionetv1.0(availableathttps://doi.org/cww7).Now,weaimtoexposethehiddeninformationthesemeasurescapture:namely,howaretwonodesrelatedintermsofmetapaths,individualpaths,andintermediatenodes.Stopbyourpostertolearnmoreanddiscusshowthissearchenginecanhelpyouperusebiomedicalknowledgeorinterpretyourcomputationalpredictions.
97
WORKSHOP:TEXTMININGANDMACHINELEARNINGFORPRECISIONMEDICINE
POSTERPRESENTATION
98
LITVAR:MININGGENOMICVARIANTSFROMBIOMEDICALLITERATUREFORDATABASECURATIONANDPRECISIONMEDICINE
AlexisAllot,YifanPeng,Chih-HsuanWei,KyubumLee,LonPhan,ZhiyongLu
NationalLibraryofMedicine,8600RockvillePike,Bethesda,MD20894Lu,ZhiyongTheidentificationandinterpretationofgenomicvariantsplayakeyroleinthediagnosisofgeneticdiseasesandrelatedresearchintheeraofprecisionmedicine.Tostayuptodate,researchersmustprocessanever-increasingamountofnewpublications.Thistaskiscomplicatedbytwofactors.First,authorsusemultipleabbreviationstorefertothesamevariant.Forexample,"A146T","c.436G>A",andAla146Thrallrefertothesamevariantrs121913527.Second,thesameabbreviation(e.g.,p.Ala94Thr)canrefertodifferentvariantsindifferentgenes.AsimplesearchonPubMedwouldthusreturnonlyasubsetofallrelevantarticlesforthevariantofinterest,whilereturningmanyarticlesthatareirrelevant.
Tohelpscientists,healthcareprofessionals,anddatabasecuratorsfindthemostup-to-datepublishedvariantresearch,wehavedevelopedLitVar,anovelwebserverforlinkinggenomicvariantdataintheliteraturewithintuitiveUI(1).Specifically,itemploysasuiteofstate-of-the-artentityrecognitiontoolsasitsbackendprocessingmethod.LitVarcombinesrobustandadvancedtextminingwithdataintegrationsfromPubMed(>28millionabstracts)andPubMedCentralSubset(>2.7millionfull-lengtharticles)toimprovebothsensitivityandspecificity.AsofMay2018,therearemorethan2millionuniquevariantsinoursystem,associatedwithhundredsofthousandsofpublicationsfromPubMedandPMCOpenAccessSubset.WhilecomparingwithPubMed,LitVarachievedanincreaseinsensitivityandspecificity.Forexample,withasearchof"rs113488022",noresultscanbefoundinPubMed,butover6,000articlesarereturnedbyLitVar.Ontheotherhand,asearchfor"H199R"onPubMedwillreturnarticleswherethisvariantpresentsbothonthegeneLIN28B(PMID:22964795)andCFTR(PMID:15084222),whilethedisambiguationprocessofLitVarwillallowtheusertoselectpreciselythevariant(andgene)ofinterest.
Tofurtherassistusers,LitVarallowsmatchingpublicationstobefilteredbyjournal,type,dateorpartofpublication.Moreover,publications'popularityintimecanbevisualisedasazoomablehistogram.Inadditiontothewebsite,LitVarprovidesRESTAPIstoallowuserstodisambiguateatextualqueryintoalistoftopmatchingvariants,orperformlarge-scaleanalysis,byretrievingpublicationslinkedtohundredsofsupplieddbSNPidentifiersinonequery.
LitVarisnowintegratedindbSNP.ThenewlyaddedlinkallowsusersnotonlytoviewmorepublicationsthanwiththelinktoPubMed,butalsotoassessthecontext(sentenceandrelateddiseases,chemicalsandothervariants)inwhichthevariantappearsineachpublication.
LitVarispubliclyavailableathttps://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/LitVar.
[1]Allot,A.,Peng,Y.,Wei,C.H.,Lee,K.,Phan,L.andLu,Z.(2018)LitVar:asemanticsearchengineforlinkinggenomicvariantdatainPubMedandPMC.NucleicAcidsRes.
99
AUTHORINDEX
A
Abyzov,Alexej·65Adusumilli,Ravali·11,83Alkan,Can·67Allot,Alexis·98Altman,RussB.·89Anand,Shankara·31Andrechek,EranR.·60Antaki,Danny·87Ausavarungnirun,Rachata·67Azizi,Shekoofeh·34
B
Bae,Ho·32Baird,Lisa·81Baldwin,Edwin·53Beam,AndrewL.·33Beaulieu-Jones,BrettK.·33Bedi,Rishi·45Benchek,Penny·73Berger,Bonnie·29Berghout,Joanne·20Best,Aaron·7Bielinski,SuzetteJ.·46Bingöl,Zülal·67BlackIII,JohnLogan·46Bobak,Carly·47Bobe,JasonR.·28Boerwinkle,Eric·46Boland,MaryRegina·3,77Bone,William·21,88Boussard,SolineM.·48Boyce,Hunter·11,83Bradford,Yuki·39BrainSeqConsortium·49Brenner,StevenE.·94Brugler,MercerR.·74Burke,EmilyE.·49Bush,WilliamS.·73,82Byers,LaurenA.·91
C
Capra,JohnA.·82Carter,Hannah·10,36,70Castorino,John·62Chen,Bin·17,60Chen,Rachel·7,27Chen,Yang·23Chen,Yong·3,77Cheng,Li-Fang·14Choo,Dongwon·63,79Chow,Cheryl-Emiliane·64Chrisman,BriannaSierra·19Christensen,BrockC.·47,80Chung,WendyK.·21,88Clemmensen,Line·86Cohen,WilliamW.·37Collado-Torres,Leonardo·49Conway,Kathleen·69Cooper,Bruce·54Cooper,DavidN.·87Coukos,George·10Crosslin,David·21,88Cule,Madeleine·15
D
Dabbagh,Karim·64DeFreitas,JessicaK.·28De,Supriyo·50Deep-Soboslay,Amy·49DeJongh,Matthew·7Denny,JoshuaC.·21,88DePristo,Mark·15DeSantis,Todd·64Ding,DaisyYi·2Dinu,Valentin·59Doerr,Megan·43Donnelly,Louise·86Dow,Michelle·36Draghici,Sorin·51Duan,Rui·3,77Dudley,JoelT.·28
100
E
Edmiston,SharonN.·69Emani,PrashantS.·93Engelhardt,BarbaraE.·14Ersboell,Bjarne·86
F
Fan,Jungwei·20Fasel,David·21,88Feinberg,NanaNikolaishvili·69Fondran,Jeremy·73Fong,LonW.·75Fornes,Oriol·71Fraenkel,Ernest·41Francavilla,C.·72Friedl,Verena·5,78Friend,Derek·27Furukawa,Tetsu·52
G
Garijo,Daniel·11,83Gasdaska,Angela·27Gay,CarlM.·91Genolet,Raphael·10Gerstein,MarkB.·93,94Gfeller,David·10Ghose,Saugata·67Ghosh,Debashis·66Gibbs,RichardA.·46Gil,Yolanda·11,83Gilbert-Diamond,Diane·80Glanville,Jacob·45Glicksberg,BenjaminS.·28,60Glisson,Bonnie·91Gold,MaxwellP.·41Gonzalez-Hernandez,Graciela·9Gordon,Max·4Gorospe,Myriam·50Graham,Kareem·64Graim,Kiley·5,78Grayson,Shira·43Greene,CaseyS.·24,96Greenside,Peyton·15Gupta,Ramneek·86Gursoy,Gamze·93
H
Haas,DavidW.·39Haines,Jonathan·73Hakonarson,Hakon·21,88Han,Jiali·53Han,Wontack·16Harari,Alexandre·10Harris,KimberleyJ.·46Hebbring,Scott·21,88Henry,Christopher·7Hernandez-Boussard,Tina·48Heymach,JohnV.·91Hill,JaneE.·47Himmelstein,Daniel·96Ho,Irvin·18Hoffmann,ThomasJ.·61Houlahan,KathleenE.·5,78Hovde,Rachel·45Hsieh,Elena·66Hu,Qiwen·24Hu,Zhiqiang·94Hu,ZhiyueTom·17Huang,Beibei·75Huang,Haiyan·17Huang,Kun·25Hyde,ThomasM.·49
I
Iakoucheva,LiliaM.·87Iribarren,Carlos·61Iwai,Shoko·64
J
Jaffe,AndrewE.·49Jain,Shantanu·35,85Jarvik,GailP.·21,88Jiang,Yuexu·6Jin,Qiao·37Johnson,KippW.·28Johnson,Travis·25Jorde,LynnB.·81Jung,Jae-Yoon·19Jung,Kenneth·2
101
K
Kale,DaveC.·2Kalesinskas,Laurynas·31Kalsi,GurpreetS.·67Kang,Byungkon·57Kang,Seokwoo·79Kaserer,Bettina·55Khan,AlyA.·18Kiefel,Helena·64Kim,Dokyoon·57Kim,JeremieS.·67Kim,Seonghyeon·63,79Kim,WooJoo·58Klein,TeriE.·48,89Kleinman,JoelE.·49Kloster,Kyle·96Kober,KordM.·54Kohane,IsaacS.·33Krauss,RonaldM.·61Krunic,Milica·55Kullo,Iftikhar·21,88Kwak,Minjung·79Kwon,Sunyoung·32
L
Larson,EricB.·21,88Lau,Denise·18Le,TrangT.·56Lee,Byunghan·32Lee,Dohyeon·63,79Lee,Garam·57Lee,JaeKyung·58Lee,Jinhee·63,79Lee,Kyubum·98LeNail,Alexander·41Leppert,Mark·81Levine,JonD.·54Li,Binglan·39Li,Haiquan·20,53Li,Jianrong·20Li,Kevin·7Li,Qike·20Lim,Sooyeon·58Linan,Margaret·59Lindsey,William·7,27Liu,Zheng·8Liu,Ke·60
Lcontinued
Liu,Xiang·37Lu,Zhiyong·98Lucas,AnastasiaM.·21,39,88Lugo-Martinez,Jose·85Lussier,YvesA.·20
M
Machiraju,Raghu·11,83Magge,Arjun·9Mallick,Parag·11,83Marsit,CarmenJ.·80Mastick,Judy·54Mayani,Rajiv·11,83McKinney,BrettA.·56Medina,MarisaW.·61Meiler,Jens·82Miaskowski,Christine·54Miller,JasonE.·61Mooney,SeanD.·87Moore,Abigail·62Moore,JasonH.·3,56,77Moore,Sarah·43Mort,Matthew·87Mousavi,Parvin·34Müllauer,Leonhard·55Muse,MeghanE.·47,80Mutlu,Onur·67
N
Nagle,Michael·96Newbury,PatrickA.·17,60Nguyen,Tin·51Nguyen,Tuan-Minh·51Nho,Kwangsik·57Nielsen,AgnesMartine·86Nielsen,RikkeLinnemann·86Noh,JiYun·58Nori,AnantV.·67
O
O'Malley,A.James·47Oh,Dongpin·63Ouyang,Zhengqing·23
102
P
Pagel,KymberleighA.·87Panda,AmareshC.·50Parker,JoelS.·69Paskov,KelleyMarie·19Patel,Neel·82Paul,Steven·54Pearson,Ewan·86Pedersen,BrentS.·81Pendergrass,SarahA.·21,88Peng,Yifan·98Petersen,CurtisL.·80Peterson,Amy·49Peterson,SandraE.·46Pfohl,Stephen·2Phan,Lon·98Poplin,Ryan·15Prasad,Niranjani·14Pyke,RachelM.·10Pyman,Blake·34
Q
Quinlan,AaronR.·81
R
Radivojac,Predrag·35,85,87Rajpurohit,Anandita·49Ramola,Rashika·35Ramsey,StephenA.·8Rasmussen-Torvik,Laura·21,88Ratnakar,Varun·11,83Ravichandar,JayamaryDivya·64Reiman,Derek·18Renwick,Neil·34Richmond,PhillipA.·71Risch,Neil·61Ritchie,MarylynD.·21,39,48,61,88Robson,Paul·91Rodriguez,Estefania·74Roychowdhury,Tanmoy·65Rudra,Pratyaydipta·66Rutherford,Erica·64
S
Sahinalp,Cenk·29Salit,Marc·15Sanders,CharlesR.·82Sarker,Abeed·9Sasani,ThomasA.·81Schaid,Daniel·21,88Scherer,Steven·46Schwartz,J.M.·72Scotch,Matthew·9Sebat,Jonathan·87Sedghi,Alireza·34Semick,StephenA.·49SenolCali,Damla·67Sha,Lingdao·18Shafi,Adib·51Shah,NigamH.·2Shin,JooHeon·49Sicotte,Hugues·46Simmons,Sean·29Simpson,Chloe·2Sivley,R.Michael·82Skola,Dylan·36Sleiman,Patrick·21,88Sliwoski,Gregory·82Smail,Craig·31Smoller,JordanW.·21,88Sohn,Kyung-Ah·57Song,Giltae·63,79Srivastava,Arunima·11,83Stanaway,Ian·21,88Stewart,C.Allison·91Stockham,NateTyler·19Straub,RIchardE.·49Stuart,JoshuaM.·5,78Subramanian,Lavanya·67Subramoney,Sreenivas·67Sullivan,Blair·96Suver,Christine·43
T
Takenaka,Yoichi·68Tan,Timothy·18Tanigawa,Yosuke·31Tao,Ran·49Tao,Yifeng·37
103
Tcontinued
Theusch,Elizabeth·61Thomas,NancyE.·69Tintle,Nathan·7,27Titus,AlexanderJ.·47Toh,Hiroyuki·52Tran,Hai·91Trosset,MichaelW.·85Tsai,Yihsuan·69Tsirigos,Konstantinos·86Tsui,Brian·36,70Tyryshkin,Kathrin·34
U
Ulrich,WilliamS.·49Urbanowicz,RyanJ.·56
V
Valencia,Cristian·49vanderLee,Robin·71Varma,Maya·19Venhuizen,Peter·55Verma,Anurag·21,39,88Verma,ShefaliS.·21,39,88Veturi,Yogasudha·21,39,88Vitali,Francesca·20vonHaeseler,Arndt·55
W
Wagner,Jennifer·43Wall,DennisPaul·19Wang,Duolin·6Wang,Haohan·12,37Wang,Jing·91Wang,Junwen·59Wang,Liewei·46Wang,Tongxin·25Washington,PeterYigitcan·19Wasserman,WyethW.·71Watson,J.·72Weeder,Benjamin·8Wei,Chih-Hsuan·98Wei,Qi·8Wei,Wei-Qi·21,88
Wcontinued
Weinberger,DanielR·49Weinmaier,Thomas·64Weinshilboum,Richard·46Weissenbacher,Davy·9Weng,Chunhua·21,88Westra,Jason·27Whaley,RyanM.·89Wheeler,Nicholas·73Whirl-Carrillo,Michelle·48,89White,Martha·85White,Ray·81Wilbanks,John·43Williams,Cranos·4Woon,Mark·89Wu,Yonggan·64Wu,Zhenglin·12
X
Xi,Yuanxin·91Xiao,Madelyne·74Xing,EricP.·12,37Xu,Dong·6
Y
Yao,Yao·8Ye,Wenting·37Ye,Yuting·17Ye,Yuzhen·16Yin,Fei·53Yoon,Sungroh·32Yu,Thomas·11,83
Z
Zawistowski,Matthew·27Zeng,William·60Zhang,Jie·25Zhang,Shuxing·75Zhang,Xinyuan·21,88Zhang,Yuping·23Zhou,Jin·53Zhou,Kaixin·86Zietz,Michael·96Zook,Justin·15