20
Compu&ng Workflows for Biologists Based on: Shade & Teal, Compu&ng Workflows for Biologists: A Roadmap, PLOS Biology Data Carpentry data organiza&on lessons

Computing Workflows for Biologists: An Overview

Embed Size (px)

Citation preview

Page 1: Computing Workflows for Biologists: An Overview

Compu&ngWorkflowsforBiologists

Basedon:Shade&Teal,Compu&ngWorkflowsforBiologists:ARoadmap,

PLOSBiologyDataCarpentrydataorganiza&onlessons

Page 2: Computing Workflows for Biologists: An Overview

•  Howmanypeoplehereplantoanalyzedatawithacomputerintheirwork?

•  Areyouworkingwithotherpeopleonthisanalysis?

•  Dootherpeopleneedtounderstandyouranalysis?

•  Doyouneedtorememberandunderstandyouranalysis?

Page 3: Computing Workflows for Biologists: An Overview

Elementsofcompu&ng

•  Howdatawasgenerated(metadata)•  Data•  Datacleaningsteps•  Dataanalysissteps•  Finalplotsandcharts

Page 4: Computing Workflows for Biologists: An Overview

Data!

•  Keeprawdataraw•  Usemeaningfulnames•  Organizeyourdatasocomputerscanreadit

Page 5: Computing Workflows for Biologists: An Overview

Keeprawdataraw

•  Whatisrawdata?•  WhyshouldIleaveitalone?

Page 6: Computing Workflows for Biologists: An Overview

Usemeaningfulnames

Page 7: Computing Workflows for Biologists: An Overview

Organizeyourdatasocomputerscanreadit

(let’stalkaboutspreadsheets)

hTp://www.datacarpentry.org/spreadsheet-ecology-lesson/00-intro.html

…alsoavoidformaZngerrors

Page 8: Computing Workflows for Biologists: An Overview

OrganizingdatainspreadsheetsThecardinalrulesofusingspreadsheetprogramsfordata:•  Putallyourvariablesincolumns-thethingyou're

measuring,like'weight'or'temperature'.•  Puteachobserva/oninitsownrow.•  Don'tcombinemul/plepiecesofinforma/oninonecell.

Some&mesitjustseemslikeonething,butthinkifthat'stheonlywayyou'llwanttobeabletouseorsortthatdata.

•  Leavetherawdataraw-don'tmesswithit!•  ExportthecleaneddatatoatextbasedformatlikeCSV.

Thisensuresthatanyonecanusethedata,andistheformatrequiredbymostdatarepositories.

Page 9: Computing Workflows for Biologists: An Overview
Page 10: Computing Workflows for Biologists: An Overview

FormaZngproblems

hTp://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes.html

Page 11: Computing Workflows for Biologists: An Overview

ARoadmapfortheCompu&ngBiologist

•  Considertheoverarchinggoalsoftheanalysis•  AdoptanItera&ve,BranchingPaTerntoSystema&callyExploreOp&ons

•  ReproducibilityCheckpoints•  TakingNotesforComputa&onalAnalysis•  SharedResponsibility:TheTeamApproachtoReproducibilityandDataManagement

ShadeandTeal,Compu&ngWorkflowsforBiologists:ARoadmaphTp://journals.plos.org/plosbiology/ar&cle?id=10.1371/journal.pbio.1002303

Page 12: Computing Workflows for Biologists: An Overview

ConsidertheOverarchingGoalsoftheAnalysis

•  Workingtoaddressagivenhypothesiswillmo&vatedifferentanalysisstrategiesthanconduc&ngdataexplora&on

Page 13: Computing Workflows for Biologists: An Overview

ReproducibilityCheckpoints

Reproducibilitycheckpointsareplacesinaworkflowdevotedtoscru&nizingitsintegrity-  theworkflow(orstepintheworkflow)canbeseamlesslyused(itdoesn’tcrashhalfwayorreturnerrormessages)

-  theoutcomesareconsistentandvalidatedacrossmul&ple,iden&calitera&ons

-  resultsshouldmakebiologicalsense

Page 14: Computing Workflows for Biologists: An Overview

AdoptanItera/ve,BranchingPaFerntoSystema/callyExploreOp/ons

Page 15: Computing Workflows for Biologists: An Overview

TakingNotesforComputa/onalAnalysis

•  Takenoteslikeyouwouldforexperimentalwork

•  Commentcode•  Useversioncontrol(Github/Gitlab)

Page 16: Computing Workflows for Biologists: An Overview

Whatneedstogoinnotes:-  Soiwareversionsused-  Descrip&onofwhatthesoiwareisdoing/goalofthatstep

-  Briefnotesondevia&onsfromdefaultop&ons-  Workflowscanincludedifferentsoiware(e.g.,PANDAseqtoQIIMEtoR),andshouldalsoincludeall“formaZngsteps”neededtomovebetweentoolshopefullyyoudon’tneedtomanuallyformattoomuch;avoidifpossible

Page 17: Computing Workflows for Biologists: An Overview

SharedResponsibility:TheTeamApproachtoReproducibilityandData

Management

Wepositthatintegrityincomputa&onalanalysisofbiologicaldataisenhancedifthereisasenseofsharedresponsibilityforensuringreproducibleworkflows.Researchteamsthatworktogethertodevelopanddebugcode,performinternalreproducibilitycheckpointsforeachother,andgenerallyholdoneanotheraccountableforhigh-qualityresultslikelywillenjoyalowmanuscriptretrac&onrate,highlevelofconfidenceintheirresults,andstrongsenseofcollabora&on.

You,yourlabmatesandPIneedtovaluethe&meittakestodoanalysesreproduciblyandcorrectly

Page 18: Computing Workflows for Biologists: An Overview

Sharedresponsibility

•  Sharedstorageandworkspacecanfacilitateaccesstoallgroupdata

•  Usingversioncontrolrepositoriescanprovideaccesstocodeanddocumenta&on(Github,Dropbox)

•  SeZngexpecta&onsfor‘reproducibilitycheckpoints’(team“hackathons”:open-computergroupmee&ngsdedicatedtoanalysis)

•  Paperreviews•  Lookingforhelp/supportoutsidethelab(bioinforma&csorusergroups,officehours,StackOverflow)

Page 19: Computing Workflows for Biologists: An Overview

Lookingforhelp

hTps://github.com/mblmicdiv/course2016/blob/master/bioinfo-resources.md

Youarenotalone

Surveyresponses

Page 20: Computing Workflows for Biologists: An Overview

Exercise

hFp:///nyurl.com/mbl-workflows