Computing Workflows for Biologists: An Overview

Preview:

Citation preview

Compu&ngWorkflowsforBiologists

Basedon:Shade&Teal,Compu&ngWorkflowsforBiologists:ARoadmap,

PLOSBiologyDataCarpentrydataorganiza&onlessons

•  Howmanypeoplehereplantoanalyzedatawithacomputerintheirwork?

•  Areyouworkingwithotherpeopleonthisanalysis?

•  Dootherpeopleneedtounderstandyouranalysis?

•  Doyouneedtorememberandunderstandyouranalysis?

Elementsofcompu&ng

•  Howdatawasgenerated(metadata)•  Data•  Datacleaningsteps•  Dataanalysissteps•  Finalplotsandcharts

Data!

•  Keeprawdataraw•  Usemeaningfulnames•  Organizeyourdatasocomputerscanreadit

Keeprawdataraw

•  Whatisrawdata?•  WhyshouldIleaveitalone?

Usemeaningfulnames

Organizeyourdatasocomputerscanreadit

(let’stalkaboutspreadsheets)

hTp://www.datacarpentry.org/spreadsheet-ecology-lesson/00-intro.html

…alsoavoidformaZngerrors

OrganizingdatainspreadsheetsThecardinalrulesofusingspreadsheetprogramsfordata:•  Putallyourvariablesincolumns-thethingyou're

measuring,like'weight'or'temperature'.•  Puteachobserva/oninitsownrow.•  Don'tcombinemul/plepiecesofinforma/oninonecell.

Some&mesitjustseemslikeonething,butthinkifthat'stheonlywayyou'llwanttobeabletouseorsortthatdata.

•  Leavetherawdataraw-don'tmesswithit!•  ExportthecleaneddatatoatextbasedformatlikeCSV.

Thisensuresthatanyonecanusethedata,andistheformatrequiredbymostdatarepositories.

FormaZngproblems

hTp://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes.html

ARoadmapfortheCompu&ngBiologist

•  Considertheoverarchinggoalsoftheanalysis•  AdoptanItera&ve,BranchingPaTerntoSystema&callyExploreOp&ons

•  ReproducibilityCheckpoints•  TakingNotesforComputa&onalAnalysis•  SharedResponsibility:TheTeamApproachtoReproducibilityandDataManagement

ShadeandTeal,Compu&ngWorkflowsforBiologists:ARoadmaphTp://journals.plos.org/plosbiology/ar&cle?id=10.1371/journal.pbio.1002303

ConsidertheOverarchingGoalsoftheAnalysis

•  Workingtoaddressagivenhypothesiswillmo&vatedifferentanalysisstrategiesthanconduc&ngdataexplora&on

ReproducibilityCheckpoints

Reproducibilitycheckpointsareplacesinaworkflowdevotedtoscru&nizingitsintegrity-  theworkflow(orstepintheworkflow)canbeseamlesslyused(itdoesn’tcrashhalfwayorreturnerrormessages)

-  theoutcomesareconsistentandvalidatedacrossmul&ple,iden&calitera&ons

-  resultsshouldmakebiologicalsense

AdoptanItera/ve,BranchingPaFerntoSystema/callyExploreOp/ons

TakingNotesforComputa/onalAnalysis

•  Takenoteslikeyouwouldforexperimentalwork

•  Commentcode•  Useversioncontrol(Github/Gitlab)

Whatneedstogoinnotes:-  Soiwareversionsused-  Descrip&onofwhatthesoiwareisdoing/goalofthatstep

-  Briefnotesondevia&onsfromdefaultop&ons-  Workflowscanincludedifferentsoiware(e.g.,PANDAseqtoQIIMEtoR),andshouldalsoincludeall“formaZngsteps”neededtomovebetweentoolshopefullyyoudon’tneedtomanuallyformattoomuch;avoidifpossible

SharedResponsibility:TheTeamApproachtoReproducibilityandData

Management

Wepositthatintegrityincomputa&onalanalysisofbiologicaldataisenhancedifthereisasenseofsharedresponsibilityforensuringreproducibleworkflows.Researchteamsthatworktogethertodevelopanddebugcode,performinternalreproducibilitycheckpointsforeachother,andgenerallyholdoneanotheraccountableforhigh-qualityresultslikelywillenjoyalowmanuscriptretrac&onrate,highlevelofconfidenceintheirresults,andstrongsenseofcollabora&on.

You,yourlabmatesandPIneedtovaluethe&meittakestodoanalysesreproduciblyandcorrectly

Sharedresponsibility

•  Sharedstorageandworkspacecanfacilitateaccesstoallgroupdata

•  Usingversioncontrolrepositoriescanprovideaccesstocodeanddocumenta&on(Github,Dropbox)

•  SeZngexpecta&onsfor‘reproducibilitycheckpoints’(team“hackathons”:open-computergroupmee&ngsdedicatedtoanalysis)

•  Paperreviews•  Lookingforhelp/supportoutsidethelab(bioinforma&csorusergroups,officehours,StackOverflow)

Lookingforhelp

hTps://github.com/mblmicdiv/course2016/blob/master/bioinfo-resources.md

Youarenotalone

Surveyresponses

Exercise

hFp:///nyurl.com/mbl-workflows

Recommended