10
BIOL 4800/01 Syllabus Fall 2016 Page 1 of 10 Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016 When: 1230-1320 W, 1230-1520 F Where: Coates 169 Instructor: J. Cameron Thrash, Ph.D. Email: [email protected] Twitter handle: @DrJCThrash My office: A112 Life Sciences Annex Office hours: By appointment (see below) Prerequisite: Permission of the Department Recommended textbooks: Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication Policy: The best way to contact me is through email and/or twitter. I will try to respond to email or twitter messages within 6 hours, except on weekends or between 2200 and 0700. I may respond much quicker, because like you I am glued to my devices, but I do have a life outside of teaching and research (when I’m lucky). If you want an appointment, email me with 1) a short description of your issue, and 2) the desired time and 3) duration of the meeting. This will be subject to my availability. I accept and encourage twitter follows, but I do not accept any other social media friend requests. Course description. In modern biology, the need for competence in computational tools is becoming as ubiquitous as that for traditional techniques like PCR. This course will provide basic training in navigating the command-line environment, utilizing common tools for genomics and ecology, submitting jobs to High Performance Computing clusters, and managing input and output files. It is NOT a programming class. Prior computational experience is helpful, but not required, as the goal of this course is to bring neophytes to a basic level of competence with some common bioinformatics methods. While the focus will be applying these to microbiological research, many tools are system/organism independent. Classes will take place in a computer lab and will have access to the LSU High Performance Computing (HPC) infrastructure. Each week will consist of two hours of theory/practical lecture and two hours of practical hands-on computer laboratory exercises (plus take-home exercises). The 4800/01 course can be taken for credit by upper-level undergraduates and graduate students equally (3 credit hours). Course learning outcomes. By the end of this course, you should be able to: Understand the basics of a HPC infrastructure Remotely access a HPC cluster using the command line Navigate and manipulate the file structure within a Linux environment Complete basic file manipulation tasks using Linux commands Write basic shell scripts for parsing input and output files and sending jobs to the compute nodes Be able to assess program performance using data subsets to accurately estimate usage requirements Download genomic information from public databases directly to a HPC cluster Execute parallel (threaded) analyses using BLAST, HMMER3, and multiple alignment tools Understand the modern sequencing platform methodologies, capabilities and limitations Execute phylogenetic inferences from multiple alignments with different platforms Perform basic automated microbial genome assembly and annotation with the A5 pipeline Assess the core and pan-genome of a group of closely related microorganisms Complete operational taxonomic unit clustering and analysis using Mothur. How we’re going to get there (Course Philosophy and Format) Circular representation of multiple bacterial genomes (Grote et al. 2012 mBio)

Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page1of10

MicrobialBioinformaticsToolsBIOL4800/01,Fall2016When:1230-1320W,1230-1520FWhere:Coates169Instructor:J.CameronThrash,Ph.D.Email:[email protected]:@DrJCThrashMyoffice:A112LifeSciencesAnnexOfficehours:Byappointment(seebelow)Prerequisite:PermissionoftheDepartmentRecommendedtextbooks:

PracticalComputingforBiologists,Haddock&Dunn;Phylogenomics,DeSalle&Rosenfeld

Coursewebsite(s):MoodleeCommunicationPolicy:Thebestwaytocontactmeisthroughemailand/ortwitter.Iwilltrytorespondtoemailortwittermessageswithin6hours,exceptonweekendsorbetween2200and0700.Imayrespondmuchquicker,becauselikeyouIamgluedtomydevices,butIdohavealifeoutsideofteachingandresearch(whenI’mlucky).Ifyouwantanappointment,emailmewith1)ashortdescriptionofyourissue,and2)thedesiredtimeand3)durationofthemeeting.Thiswillbesubjecttomyavailability.Iacceptandencouragetwitterfollows,butIdonotacceptanyothersocialmediafriendrequests.Coursedescription.Inmodernbiology,theneedforcompetenceincomputationaltoolsisbecomingasubiquitousasthatfortraditionaltechniqueslikePCR.Thiscoursewillprovidebasictraininginnavigatingthecommand-lineenvironment,utilizingcommontoolsforgenomicsandecology,submittingjobstoHighPerformanceComputingclusters,andmanaginginputandoutputfiles.ItisNOTaprogrammingclass.Priorcomputationalexperienceishelpful,butnotrequired,asthegoalofthiscourseistobringneophytestoabasiclevelofcompetencewithsomecommonbioinformaticsmethods.Whilethefocuswillbeapplyingthesetomicrobiologicalresearch,manytoolsaresystem/organismindependent.ClasseswilltakeplaceinacomputerlabandwillhaveaccesstotheLSUHighPerformanceComputing(HPC)infrastructure.Eachweekwillconsistoftwohoursoftheory/practicallectureandtwohoursofpracticalhands-oncomputerlaboratoryexercises(plustake-homeexercises).The4800/01coursecanbetakenforcreditbyupper-levelundergraduatesandgraduatestudentsequally(3credithours).Courselearningoutcomes.Bytheendofthiscourse,youshouldbeableto:

• UnderstandthebasicsofaHPCinfrastructure• RemotelyaccessaHPCclusterusingthecommandline• NavigateandmanipulatethefilestructurewithinaLinuxenvironment• CompletebasicfilemanipulationtasksusingLinuxcommands• Writebasicshellscriptsforparsinginputandoutputfilesandsendingjobstothecomputenodes• Beabletoassessprogramperformanceusingdatasubsetstoaccuratelyestimateusage

requirements• DownloadgenomicinformationfrompublicdatabasesdirectlytoaHPCcluster• Executeparallel(threaded)analysesusingBLAST,HMMER3,andmultiplealignmenttools• Understandthemodernsequencingplatformmethodologies,capabilitiesandlimitations• Executephylogeneticinferencesfrommultiplealignmentswithdifferentplatforms• PerformbasicautomatedmicrobialgenomeassemblyandannotationwiththeA5pipeline• Assessthecoreandpan-genomeofagroupofcloselyrelatedmicroorganisms• CompleteoperationaltaxonomicunitclusteringandanalysisusingMothur.

Howwe’regoingtogetthere(CoursePhilosophyandFormat)

Circularrepresentationofmultiplebacterial

genomes(Groteetal.2012mBio)

Page 2: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page2of10

Philosophy.Thiscourseisdesignedtogetyoutoabasicworkingknowledgeofmanyofthecommontoolsusedinmodernbioinformatics,particularlyasappliedtomicrobialgenomics.Thisisacombinedlecture/laboratorycourse,withthelaboratoryportionspentutilizingcomputersinsteadofatypicalwetlab.WhiletherewillbesomelecturecomponentduringtheWednesdayclass,asmuchaspossiblethisperiodwillhaveactivelearningexercisesinsteadofmesimplystandingaroundtalking.Extensiveresearchoneducationandtheneuroscienceoflearninghasshowntherearemuchmoreeffectivewaysforustolearnthanbysittingandlisteningtoapersonstandinthefrontoftheroomandtalk.Youdon’thavetocometoclasstolearnthatwayanyway,forthereareendlesslecturesandresourcesavailableonline,manyfromthemosteminentscientistsintheirfields.Someofthesewillbepartofyourpre-classassignments.Therefore,Iendeavortomakeclasstimeaseffectiveaspossibleforstimulatingyourinvestmentinthematerialandactivatingallmodesofthinking.Theaddedbenefitofbeingabletodothisworkyourselvesinthelabportionoftheclasswillhelpcompletetheprocess.Classroommechanics.Theone-hourWednesdayclasswillbechieflyconcernedwithintroducingthetoolswewillbelearningtouse.Thethree-hourFridayclasswillsometimeshaveapracticallectureforinstructionalpurposesandthenatleasttwohourstoworkonyourassignments(detailedbelow),whichwillbegradedaccordingtoaspecificrubric,andinvolveashortwrittencomponent.Therewillbesomeassignedreadings/podcasts/web-videos/lectureslidesyouwillberesponsibleforbeforeeachclassperiod,listedas“Readings,etc.”intheclassschedule,below.Iwillalsointroducethesepre-lectureassignmentseachweekbyemail.TherewillbeashortonlineMoodlequizonthematerialforWednesdaysthatclosesonehourbeforeclass.ComputationalRequirements.Ourclasseswillbeconductedinacomputerlab.Priortothefirstclass,youneedtohaverequestedanaccountwithLSUHPCforaccesstothesupercomputerSuperMike-II(https://accounts.hpc.lsu.edu/allocations.php).Youwillusethisaccountforcompletingclassroomexercisesandmajorassignments.Forin-classexercises,youwillbeusingthelabcomputersorapersonallaptopandloggingonthroughaterminal.Foryourmajorassignments,youwillneedanothercomputerwithterminallogincapabilitiessothatyoumayaccessSuperMike-IIremotely.Allexercisesinvolvingsignificantcomputationaleffortwillrequiretheuseofaclasscomputingallocation.DetailsforoperatingintheHPCenvironmentwillbepresentedduringthefirsttwoweeksofthecourse.Youwillbegradedonthefollowing:Quizzes 10%WeeklyAssignments 90%

Therewillbe1000totalpoints,gradedaccordingly:A-900-929;A930-969;A+>969B-800-829;B830-869;B+870-899C-700-729;C730-769;C+770-799D-600-629;D630-669;D+670-699F<600

Lateassignments.Assignmentswillsacrifice25%oftheirpointsperdaytheyarelate.Dealingwithchallenges.Makingmistakesandrunningintoroadblocksisinherenttotheprocessoflearning.Bydesign,thiscoursewillchallengeyoutofigureoutsolutions,notsimplygiveyouthepathfrompointAtopointZ.Thereasonforthisisthatthestruggletoovercomewhateverchallengeyoufaceiswherethetruelearninghappens.Therefore,whileyourfirstimpulsewhenyouencounteraproblemwillprobablybetoemailme,thiswillnotbemetwiththetypeofresponseyouarelookingforUNLESSyouhavedoneallofthefollowingproblem-solvingeffortsfirst,inthisorder:

1. Think.Reviewyourcommands,inputs,andoutputs.Seeifyoucanfigureoutwhatwentwrong.Oftenit’ssimplythataspaceismissingsomewhere,acommaismisplaced,oryou’reusinga‘insteadofa`.Trytofindthisyourselfbeforebotheringanyoneelse.

2. Consulttheinternet.Thepeoplewhohavedevelopedandusetheprogramsyouarelearninghavecreatedamassiveamountofonlineresources.Oftengooglingyourerrormessagewillallowyoutofindtheproblem.Creativegooglesearchingcanusuallydotherest.

3. Consultyourpeers.Iseveryoneintheclasshavingthesameproblem,orhassomeoneelsediscoveredthesolution?Whilethismayseemlikethesamethingasaskingtheprofessor,askingyourclassmatesfosterspeer-to-peerinstruction,whichreinforcesconceptsforthosewhogetthe

Page 3: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page3of10

opportunitytoteachtheirsolutionandgivesadifferentperspectivetothosewhoareseekinganswers.

4. ConsultDr.Thrash.Ifyou’restillhavingproblems,byallmeanscontactmeorsetupanappointmentforofficehours.Solutionsaresometimesverysimplebutobscure.

OtherimportantinformationAbsences/CodeofStudentConduct.Youareexpectedtohaveread,understand,andadheretotheLSUAbsencePolicy(http://saa.lsu.edu/important-lsu-policies)andtheCodeofStudentConduct(http://saa.lsu.edu/code-student-conduct).Ourgoalshouldbetolearn,notsimplytogetgrades.Inscience,asinlife,yourintegrityisoneof,ifnotthe,mostvaluableassetyouhave.Preserveit,protectit,cultivateit.StudentswithDisabilities.Ifanyonehasadisabilitythatmayrequireaccommodation,youshouldimmediatelycontacttheofficeofServicesforStudentswithDisabilitiestoofficiallydocumenttheneededaccommodation.Theinstructormustbepresentedwiththisdocumentationduringthefirstweekofclass.Timerequirements.Itisexpectedthatyouwillhavereadorviewedtheassignedmaterialpriortoclassforthebackgroundnecessarytoproperlyparticipateintheactivitiesandthinkcriticallyabouttheconceptsaddressed.Asageneralpolicy,foreachhouryouareinclass,you(thestudent)shouldplantospendatleasttwohourspreparingforthenextclass.Sincethiscourseisforthreecredithours,youshouldexpecttospendaroundsixhoursoutsideofclasseachweekreadingorworkingonassignmentsfortheclass.ClassscheduleThescheduleispreliminaryandsubjecttochangedependingonhowquicklywearemovingthroughthematerial.Detailsonyourpre-classreadings,etc.,aresuppliedbelow.Class Week Date Subject Readings,etc. Assignment1 1 August24th(W) Thecommandlineenvironment 1-6 2 1 August26th(F) HPCTutorial WA013 2 August31st(W) BasicLinuxcommands 7-11 4 2 September2nd(F) MoreLinux,shelltexteditors WA025 3 September7th(W) MoreLinux,databaseaccess,bashscripts 12 6 3 September9th(F) Download,manipulatefasta/Genbankfiles WA037 4 September14th(W) Localalignment&dynamicprogramming 13-15 8 4 September16th(F) BLAST WA049 5 September21st(W) Multiplesequencealignment 16-19 10 5 September23rd(F) clustalW,MUSCLE,Gblocks WA0511 6 September28th(W) Singlegenephylogeny 20-24 12 6 September30th(F) RAxML,FastTree,clustalw WA0613 7 October5th(W) Genomesequencing 25,26 WA07 7 October7th(F) FallBreak 14 8 October12th(W) HiddenMarkovModels 27,28 15 8 October14th(F) HMMER3 WA0816 9 October19th(W) IntrotoA5 29,30 17 9 October21st(F) A5pipelineandevaluationstats WA0918 10 October26th(W) Annotation:theSEEDviewer 31,32 19 10 October28th(F) Analyzingandcomparingannotations WA1020 11 November2nd(W) MicrobiomesI:background 33-35 21 11 November4th(F) MicrobiomesI:OTUsandecologicalanalysis WA1122 12 November9th(W) MicrobiomesII:Mothur 36,37 23 12 November11th(F) MicrobiomesII:Mothur/analysis WA1224 13 November16th(W) Assessingcoreandpangenomes 38-40 25 13 November18th(F) Orthologydetermination,GetHomologues WA13 14 November23rd(W) Thanksgiving 14 November25th(F) Thanksgiving 26 15 November30th(W) PresentingWA13 27 15 December2nd(F) PresentingWA13

Page 4: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page4of10

Readings,etc.tobecompletedbeforeclass1. ReadtheSyllabus2. ApplyforanHPCaccount(https://accounts.hpc.lsu.edu/login_request.php)3. SoftwareCarpentryUnixshelltutorials(https://v4.software-carpentry.org/shell/index.html)-

Introduction,FilesandDirectories,CreatingandDeleting4. ReviewtheHPCJumpstart.pdfandtheHPC@LSUwebsite(http://www.hpc.lsu.edu),familiarizing

yourselfwithAccountsandAllocations,theLSUHPCUsagePolicy,theUserGuideforSuperMikeII,andtheComputationalBiologytoolsavailable.

5. HaddockandDunn,Chapter4(Optional)6. HaddockandDunn,Chapter20(Optional)7. SoftwareCarpentryUnixshelltutorials(https://v4.software-carpentry.org/shell/index.html)-Pipesand

Filtersthroughremainingtutorials.8. SoftwareCarpentryregextutorials(https://v4.software-carpentry.org/regexp/index.html)-all.9. HaddockandDunn,Chapter2(Optional)10. HaddockandDunn,Chapter3(Optional)11. HaddockandDunn,Chapter5(Optional)12. DeSalleandRosenfeld,Chapter4(Optional)13. WebresearchontheBLASTsuite.14. Eddy,S.R.(2004).Whatisdynamicprogramming?NatureBiotechnology,22(7),909–910.(Optional)15. DeSalleandRosenfeld,Chapter5(Optional)16. WikipediaintrotoMSA:https://en.wikipedia.org/wiki/Multiple_sequence_alignment17. Edgar,R.C.(2004).MUSCLE:multiplesequencealignmentwithhighaccuracyandhighthroughput.

NucleicAcidsResearch,32(5),1792–1797.18. Castresana,J.(2000).Selectionofconservedblocksfrommultiplealignmentsfortheirusein

phylogeneticanalysis.MolecularBiologyandEvolution,17(4),540–552.19. DeSalleandRosenfeld,Chapter6(Optional)20. Wikipediaintrotophylogenetictrees:https://en.wikipedia.org/wiki/Phylogenetic_tree21. SlidesfromDr.JonathanEisen22. Price,M.N.,Dehal,P.S.,&Arkin,A.P.(2010).FastTree2--approximatelymaximum-likelihoodtreesfor

largealignments.PlosOne,5(3),e9490.23. Stamatakis,A.(2006).RAxML-VI-HPC:maximumlikelihood-basedphylogeneticanalyseswiththousands

oftaxaandmixedmodels.Bioinformatics,22(21),2688–2690.24. DeSalleandRosenfeld,Chapter8(Optional)25. Metzker,M.L.(2010).Sequencingtechnologies-thenextgeneration.NatureReviewsGenetics,11(1),31–

46.26. EvolutionofDNASequencingMethods,talkbyJonathanEisen:

https://www.youtube.com/watch?v=s9UbA7VyISQ27. Eddy,S.R.(1998).ProfilehiddenMarkovmodels.Bioinformatics,14(9),755.28. Eddy,S.R.(2011).AcceleratedProfileHMMSearches.PLOSComputationalBiology,7(10),e1002195.29. Tritt,A.,Eisen,J.A.,Facciotti,M.T.,&Darling,A.E.(2012).AnIntegratedPipelinefordeNovoAssembly

ofMicrobialGenomes.PlosOne,7(9),e42304.30. Coil,D.,Jospin,G.,&Darling,A.E.(2015).A5-miseq:anupdatedpipelinetoassemblemicrobialgenomes

fromIlluminaMiSeqdata.Bioinformatics,31(4),587–589.31. Edwards,D.J.,&Holt,K.E.(2013).Beginner'sguidetocomparativebacterialgenomeanalysisusingnext-

generationsequencedata.MicrobialInformaticsandExperimentation,3(1),2.32. Overbeek,R.,Olson,R.,Pusch,G.D.,Olsen,G.J.,Davis,J.J.,Disz,T.,etal.(2014).TheSEEDandtheRapid

AnnotationofmicrobialgenomesusingSubsystemsTechnology(RAST).NucleicAcidsResearch,42(Databaseissue),D206–14.

33. Goodrich,J.K.,DiRienzi,S.C.,Poole,A.C.,Koren,O.,Walters,W.A.,Caporaso,J.G.,etal.(2014).ConductingaMicrobiomeStudy.Cell,158(2),250–262.

34. Seekatz,A.M.,Aas,J.,Gessert,C.E.,Rubin,T.A.,Saman,D.M.,Bakken,J.S.,&Young,V.B.(2014).RecoveryoftheGutMicrobiomefollowingFecalMicrobiotaTransplantation.mBio,5(3),e00893–14–e00893–14.

Page 5: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page5of10

35. Schloss,P.D.,Westcott,S.L.,Ryabin,T.,Hall,J.R.,Hartmann,M.,Hollister,E.B.,etal.(2009).Introducingmothur:Open-Source,Platform-Independent,Community-SupportedSoftwareforDescribingandComparingMicrobialCommunities.AppliedandEnvironmentalMicrobiology,75(23),7537–7541.

36. MothurMiSeqSOP37. KozichJJ,WestcottSL,BaxterNT,HighlanderSK,SchlossPD.(2013).Developmentofadual-index

sequencingstrategyandcurationpipelineforanalyzingampliconsequencedataontheMiSeqIlluminasequencingplatform.AppliedandEnvironmentalMicrobiology.79(17):5112-20.

38. Tettelin,H.,Masignani,V.,Cieslewicz,M.J.,Donati,C.,Medini,D.,Ward,N.L.,etal.(2005).GenomeanalysisofmultiplepathogenicisolatesofStreptococcusagalactiae:implicationsforthemicrobial"pan-genome".ProceedingsoftheNationalAcademyofSciences,102(39),13950–13955.doi:10.1073/pnas.0506758102

39. Grote,J.,Thrash,J.C.,Huggett,M.J.,Landry,Z.C.,Carini,P.,Giovannoni,S.J.,&Rappé,M.S.(2012).StreamliningandCoreGenomeConservationamongHighlyDivergentMembersoftheSAR11Clade.mBio,3(5),e00252–12.

40. Contreras-Moreira,B.,&Vinuesa,P.(2013).GET_HOMOLOGUES,aversatilesoftwarepackageforscalableandrobustmicrobialpangenomeanalysis.AppliedandEnvironmentalMicrobiology,79(24),7696–7701.doi:10.1128/AEM.02411-13

StayingOrganizedPartofanygoodcomputationalbiologyworkflowiskeepingyourinputsandoutputsorganized,andallyourprocessesandthecontentsofeachfileanddirectorydocumented.Thisnotonlyallowssomeoneelsetounderstandandreproduceyourwork,butpreventsyoufromforgettingthevaluablestepsyoutooktoproduceyourworkaswell.It’sahorriblefeelingtoenteradirectoryayearafterworkingonaprojectandnotrememberthecontentsofthefilesorhowtheywerecreated.Throughoutthesemester,wewillutilizeacommoncoresetoforganizationalprocedurestofacilitatekeepingorganized.Eachweekwillhaveaseparatedirectoryinyourhomedirectorywhereyouwillstoreinputsandoutputs,andpossiblyincludesubdirectories.YouwilldocumentthecontentsofeachfileandsubdirectoryinaREADMEdocument,includingoneforeachdirectory.Finally,foreachassignmentyouwillcreateabriefsummaryreport,describedbelow.Allthreeofthesedocumentswillbeinstrumentalinyourgrade.(Youmayfindthatwhenyoubranchoutonyourown,adifferentsystemmaysuityou.Regardless,itisimportanttoleaveatransparenttrailofallyourworksothatitcanberecreatedatanypointinthefuture.Thepointhereissimplytoenforcegoodpracticesincomputationalbiology,andwehavetopickonesysteminadvance.)ReportsEachweekyouwillbecompletingaseriesoftasksusingatoolorsetoftools.Aspartofyourassignmentsyouneedtoincludeashortwrittenreportwiththeelementsbelow.Ifthereportincludesonlytext,createitwithatexteditor(e.g.nano)andsaveas~/<workingdirectory>/report.txt.Ifitincludesgraphics,saveasa.docxor.pdffile,anduploadto~/<workingdirectory>/report.docx(or.pdf).

1. Name(s),date2. General(1-line)summaryofobjective(s)andpurpose(s)3. Workingdirectory4. Programsused,includingbasicscripts,andrelevantreference(s)5. Commands,inputs,outputsandresults/evaluationofoutputforNEWoperations.*

• Organizeinsectionsaccordingtotherubricinoutlineform.• Includespecificfilenames.• Forbatchjobs,indicatetheimportantcommand(s)andthenameofthePBSscript.• Forrepeatingtasks,onlydetailthefirstinstance,thenindicatethatthiswasrepeatedandnote

variationininput/outputfilenames.Similarly,outputonlyhastobeshownforafirstinstance.• *Foroperationsyouarerepeatingfrompreviousassignments(blastp,muscle,etc.),youmay

simplyreferenceapreviousreportforyourworkflow,butyouneedtobespecificenoughthatonecouldfindthecorrectcommandandrepeatit.

6. Personalreflection.Whatdidyoulearninadditiontousingtheassignedtool(s)?Whatwouldyoudodifferently?Whatareyoustillconfusedabout?

Page 6: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page6of10

WeeklyAssignments(WA)WA01. LearningthecommandlineandHPCtutorial(40pts)

a. Pathhomework(relativevs.absolute).Createa~/week1/paths.txtfileonSuperMike-IIwiththefollowing,oneperline:

i. ArelativepathtoyourhomedirectoryonMikeii. TheabsolutepathtoyourhomedirectoryonMikeiii. Arelativepathtoyourworkdirectoryiv. Theabsolutepathtoyourworkdirectoryv. Arelativepathtoanotherstudent’shomedirectoryvi. Theabsolutepathtoanotherstudent’shomedirectory

b. Logging.Createatab-delimited~/week1/report.txtfilethatincludestheuniquecommandsyouhaveused(historyisveryhelpfulhere).MakesuretoincludewhatyouhavelearnedabouttheHPC@LSUandwhatyouarestillconfusedabout.Createatab-delimited~/week1/READMEfilethatcontainseachofthefilesineachdirectory,atabover,andabriefdescriptionofcontents.

WA02. BuildingonourLinuxskillsandincorporatingshelltexteditors(60pts)a. Ingroupsoftwo(number-assigned),researchandteachtheclassaboutoneofthefollowing

(agoodstartingpointwillbethe“cheatsheets”andtheirrespectiveweblocations):i. head,tailii. wciii. grepiv. >vs.>>v. sortvi. uniqvii. sedviii. piping

Youwillneedtoexplainthetool/concept,whatitcanbeusedfororwhereitisused,andprovideanexampleusingabasicfastafile.Eachgroupwillhave4minutes.Performthewholepresentationonthecommandline(i.e.,don’tcreatepowerpointsforthis).WewillstartonWednesdayandcontinueintoFridayifnecessary.

b. CreateasetoffivepipedcommandsthatutilizeanythreeoftheLinuxtools(filters)youlearnedlastweektomanipulateafastafile.Documentthesecommands,theirpurpose,andtheinputandoutputinyourreport.UsenanotocreatethetextwhileloggedintoSuperMike-II.

c. Logging.Createatab-delimited~/week2/report.txtfilethatincludestheuniquecommandsyouhaveused(historyisveryhelpfulhere).Createatab-delimited~/week2/READMEfilethatcontainseachofthefilesineachdirectory,atabover,andabriefdescriptionofcontents.

WA03. Bashscripts,downloading,andLinuxpractice(70pts)Createtwoseparatebashscriptsusingthetoolswe’vecoveredthusfar(orothersthatyouknowabout),runthem,anddocumenttheirinput/outputina~/week3/READMEfile:

a. Abashscriptyouruninyourworkingdirectoryb. APBSsubmissionscript,submittedviaqsub

Downloadthegenomesequencesforagroupofcloselyrelated(sameFamily)microorganismsfromGenBankandpracticemanipulatingfastafileswithbasiclinuxcommands.

c. Compilethenames,GenBankentries,andphylogenyinatab-delimitedfileforfiveorganisms.

d. Downloadproteinfastas,GenBankfiles,nucleotidefastas,andscaffoldfastasfortheorganismsyouidentified(total=20files).

e. Usepiping,linuxcommands,fastaToTab,tabToFasta,andgenbank_to_fasta.py,completethefollowing:

i. Convertyourgenbankfilesto.fastaii. Comparethenumberofgenesinthe.faafilesyoudownloadedwiththeconverted

files

Page 7: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page7of10

iii. Splityour.faafilesintointofastafilesof100geneseachiv. Createasinglefilewithallthegeneannotationsfromallyourgenomes

f. Logging.READMEandreportfiles.WA04. Learningtoexecutethethreebasicaspectsoftheblastsuite-makingadatabase,searching

sequencesagainstadatabase,andqueryingthedatabaseforadditionalinformationusingalternativesearchinput.(80pts)

a. Makeadatabasefromyourgenomescaffoldsusingmakeblastdb.b. Executeaproteinblastagainstatranslatednucleotidedatabasewithtblastnusing100aa

sequencesfromagivengenomeagainstanotherstudent’sscaffolddatabase(withdifferentorganismsthanyours).

c. PerformanassessmentofblastefficiencyusingblastpagainstIMGv4,followingthe10,100,1000rule.Youwillneedtosplitproteinfastasequencesintosubsetsof10,100,and1000proteinsequencesandrunblastpwiththesesubsetsagainstthedatabaseusing1,2,4,and16processors(12totalblastpjobsubmissions).Usingyourstandardoutinformation,createtwographsoftheperformance,onewithnumberofsequencesvs.timeforagivensetofprocessors,theotherwiththeamountoftimepersequencesearchvs.numberofprocessors.Produceawrittensummaryofyourresultstoaccompanyyourgraphs.

d. Usingtheinformationfromyour10,100,1000assessment,executeathreadedBLASTsearchofthehypotheticalproteinsinyourfivegenomedatasetagainsttheIMGv4databasewiththeidealnumberofprocessorsandtimerequested.

e. Collectthesequencesforthetop100hitstooneoftheseproteinsusingblastdbcmd.Thiswillrequireyoutouseasetofpipedlinuxcommandsinconjunctionwithblastdbcmd,which,amongotherthings,acceptssequenceaccessionnumbersasinputandoutputsavarietyofinformation,includingthesequencedatainfastaformat.

f. READMEandreportfiles.WA05. Produceandvisualizethreadedmultiple-sequencealignments,editforpoorlycuratedsites,and

evaluatevariancebetweentwodifferentalignmentprograms.(70pts)a. In-classresearchonfastavs.phylipformattedalignments

i. Descriptionofthedifferencebetweenthetwoii. Listofthreetools/sitesthatdoconversion

b. In-classresearchonalignmentviewers-findthreec. Pickthreedifferentproteinsequencesinanyofyourgenomes,includingRecA,getthetop20

hitsfromtheIMGv4database.Foreachgene,placethequeryandhitsequencesintoasingle.faafile(21totalsequencesforeachofthe3proteinfastas).

d. AligneachfilebothMUSCLEandCLUSTAL.Visualizethealignmentswithgraphicalsoftware,comparebyeye.Describehowthealignmentforagivengenediffersbetweenprograms.

e. EditwithGblocksusingthesettingsfromSasseraetal.2011andnotethealignmentvariation.Youmayneedtoconvertyouralignmentsfromphyliptofastaformatfirst.

f. READMEandreportfiles.WA06. Executephylogeneticinferencesusingthreedifferentprogramsfor2ofthegenesfromyour

genomesand60tophitstoIMGv4.(70pts)a. Constructingatreeonpaperb. Identifyaribosomalproteinwith>100aminoacids,andoneothergeneinyourgenomethat

havetodowithcentralmetabolism,pathogenicity,orrespiration.PerformablastoftheirproteinsequencesagainstIMGv4,andcollectthetop50hits(initialsequenceincluded).Youwillalsoneedtopick2outgroupsfromblasthitswithconsiderablylessidentitytoyourquerysequencethanthetop50hits.

c. PerformMUSCLEalignmentsandcullwithGblocksusingthesettingsfromSasseraetal.2011.

d. ExecutephylogeneticanalysiswithPBSsubmissionstotheclusterusingClustalW(tocreateatreethistime,notanalignment),FastTree2,andRAxML.Thelatterwillneedtoberuninathreadedformatwith16processors.Someofthesewillneedinputdatainphylipformat.Use1000bootstrapsforClustalWandRAxML.

e. Usingatree-viewer,outputandcomparethetopologyandnodeconfidencebetweenthedifferenttreesforagivengene.Besuretorootyourtreeonyouroutgroupsequence.

Page 8: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page8of10

Compileashortsummarywithtreegraphicsdescribingthesevariablesandaddtoyourreport.

f. READMEandreportfiles.WA07. TakeHomeEssay(therewillbeanin-classexerciseworth10pts).WA10:TakeHomeEssay.Finda

“microbiome”studyintheprimaryliterature,andidentifyanimportantorganisminthesystem.Createamaximumtwo-pageproposalforsequencingofthisorganism’sgenome(50pts).Inyourproposalyoumustinclude:

a. Yourmotivationforsequencingthisparticularstrain.Whatmakesitimportant?Whyshouldwecareaboutthisorganism?Includeecologicaldatathatdemonstratewhere/whenthisorganismisfound.Whatwillsequencingthisorganismhelpyoutounderstandaboutthesystemit’sin(thinkaboutthingslikephysiology,populationgenetics,etc.)?

b. Phylogeneticcontext.Whereinthetreeoflifedoesthisorganismsit?Whatareitsclosestrelatives?Areanyofthesealreadysequenced?

c. Sequencingparameters.Whattechnologywouldyouliketouse?Why?HowmuchDNAwillyouneed,howmanylanes/runs/etc.willyouuse,andhowmuchcoveragedoyouexpecttoget?

d. Allinformationneedstobecompletelyreferencedwithprimaryliterature,exceptinformulatinghowmuchcoverageyouwillgetforagiventechnology.Thiscancomefromwebsites,butmustbecitednonetheless.Noreferences,nocreditforentireassignment.

WA08. Creating,searching,andscanningHMMsforseveralofyourhypotheticalproteinsusingHMMER3.(70pts)

a. HMMscavengerhuntb. CreateHMMsforthreeofthehypotheticalproteinsinyourvariousgenomesthatare≥100

aminoacids,usingthetop30blastphitstoIMGasthefoundationforyourmultiplealignments

c. hmmsearchoneoftheseHMMsagainstRefSeq.CompareyourresultstoblastpsearchesagainstRefSeq,usingonlythetop15hitsfromeachsearch.

d. Asaclass,combineallofyourHMMsintoasinglefileandcreateaHMMdatabasetosearchagainstusinghmmscan.OnepersonmusthostthisdatabaseintheirBIOL4800directory.

e. hmmscananewhypotheticalproteinsequence(youwillhavetoidentifyadifferenthypotheticalfromoneofyourgenomes)againstthisdatabase,andnotethebesthit.

f. READMEandreportfiles.Besuretoincludewhichmodelsyourproteinsmatchbest.Notewhichgroupcreatedthatmodel,andwhatsequenceswereusedtocreateit.

WA09. ExecuteagenomeassemblyusingtheA5pipelineandanalyzethecompletedassemblyusingassemstats2.py(70pts).Youwillworkingroupsoftwotocompletethisassignment,aswellasneedtoconsulttwoothergroupstocompareyourassemblies.ThecompletedassignmentwillincludeacopyoftheoutputfilesfromtheA5assembly,atableofyourgenomestatistics,twotablescomparingyourgenomestatstotwoothergroups,abriefhalfpagewriteup,andacompletedsubmissiontoRAST.

a. In-classtextassemblywithyourgroupi. http://ivory.idyll.org/blog/the-assembly-exercise.html(TitusBrown)

b. Reflectivewritingi. Whatworkedanddidn’twithyour“genome”assembly?ii. Ifthe“genome”haderrorshowwouldyoucorrectforthem?iii. Whatwouldhavehappenediftherewererepeatsegmentsinyourgenome?

c. DownloadtherawsequencingdataforamicrobialgenomeofchoiceinGenBank.d. CompleteanassemblyofamicrobialgenomeusingA5.e. Onceassembled,examinetheassemblystatscreatedbyA5.

i. WhatdothesestatisticstellyouabouthowwellorpoorlyA5assembledyourgenomesequences?

f. Evaluateyourassemblycomparedtothoseofyourclassmates.Picktwoothersandcreateatablecomparingallthreeassemblies.

i. Howdoyourgenomescompare?ii. Whatmadeyourgenomehavea“better”or“worse”assemblywhencomparedto

thatofothers?

Page 9: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page9of10

iii. Thisdiscussionshouldberoughlyahalfpagelongandaddresstheabovequestionsaswellasthequestionine.

g. SubmityourassemblytoRASTtobeannotated.h. READMEandreportfiles(include-thereflectivewritingdoneearlier).

WA10. ObtaintheannotatedoutputfromRAST,compareseveralsubsystemswithotherstrains,presentasummaryofthebasicfeaturesofyourassembly,andperformseveralrudimentaryanalysesbetweenthegenesfromyourassemblyandthoseofyourothergenomes(70pts).Workingwithyourproposal/assemblygroup:

a. SEEDviewerscavengerhunt.b. DeterminethefourmostcloselyrelatedorganismswithsequencedgenomesintheSEED

databasetothegenomeyou’veassembled.Hint,you’regoingtowanttoidentifyconservedgenesthatcanbeusedtolookforotherorganisms.

c. Puttogetheratablecomparingyourassembledcontigdatawithfourothercloselyrelatedgenomes,includingthefollowinginformationforallofthem:Organismname,Isolationsource,Genomesize(Mbp),numberofcontigs,GCcontent%,Totalnumberofgenes,Numberofproteincodinggenes,NumberoftRNAs

d. Compare,foryourassembledgenomeandtheclosestrelative,thegenepresence/absenceprofileforthefollowingsubsystems:glycolysisandgluconeogenesis,flagellarmotility,andmulti-drugeffluxpumps.

e. Pickthreeproteinsfromyourassembly,includingoneribosomalprotein,blasttheseproteinsagainstRefSeqandconstructphylogeniesforeachgenewithatleast15members,includinganoutgroup.Thiswillresultinthreetotaltrees.Youmayusewhateveralignmentandtree-buildingalgorithmyouwish,makingsuretocullwithGblocks.Outputthetreesandsummarizehowthetopologiesaresimilarand/ordifferentfromeachother.AlsonotewhetherornotthegenomesofyourotherfourorganismsarepresentintheRefSeqresults,andwhetherornottheyarestilltheclosestneighborstoyourgenome.

f. READMEandreportfiles.WA11. LearningtomeasureandestimatemicrobialdiversityinpreparationofusingMothur.(50pts)

a. In-classshortreflectiveessayonthedefinitionofOTUsandhowtheyareusedinmicrobialecology-explainfrommemory.Thentake10minutestodowebresearchonprimaryliterature.Re-defineOTUsagain,citingyourreferences.

b. Mark,release,recapture/rarefaction/relativeabundanceworksheet.Questionsandwritingsposedintheworksheetneedtobepartofthereport,alongwithfinaltablesandgraphs.

WA12. OTUanalysisofLSUMikereauxbiomedatausingMothur.(90pts)Youwillbeworkingingroupsoftwo,analyzingdatafromfoursamplesinoneofseveraldifferentsampletypes.MostofyourworkflowwillcomefromfollowingtheMothurMiSeqSOP,buttherewillbesomestepsthataremodifiedand/orleftoutfromtheSOP.Wewillidentifysomeoftheseinclass.

a. Createaworkflowforyouranalysis.Includeasmanyspecificcommandsaspossible,andannotatethesewiththeirpurpose.

b. Obtainthesequencedataforthesampleswithwhichyouwillbeworking,beingsuretoincludebothforwardandreversereads.PerformacompleteMothurrun,throughOTUclusteringandtaxonomicassignment.

c. Completethefollowinganalysesofyourdata:i. Rarefyyourdataii. Chao1richnessandinverseSimpsondiversityiii. RarefactioncurvesofOTUsvs.samplingeffortiv. Tableof#seqs,coverage,#OTUs,InverseSimpsonv. RelativeabundanceheatmapwithJaccardindexvi. VenndiagramofsharedOTUs,includingpredicted#ofoverlappingOTUsvii. Createfourrankabundancecurves-oneeachforthetop20OTUsineachofyour

samples.Compareyourtop10taxawiththosefromtheothergroups.WA13. Completecoreandpan-genomeanalysisofcloselyrelatedgenomeswithGetHomologues,reporting

thevariousadditionaloutputs.Incorporateadditionalmaterialfromsomeoftheothertoolsyou’vedealtwiththusfar.Formatforfinalpresentationtotheclass.Youmayworkindividuallyoringroups.Youwillneedtocdtotheget_homloguesdirectoryin/project/jcthrash/tools/,completeyourwork,andthen

Page 10: Microbial Bioinformatics Tools BIOL 4800/01, Fall 2016Practical Computing for Biologists, Haddock & Dunn; Phylogenomics, DeSalle & Rosenfeld Course website(s): Moodle eCommunication

BIOL4800/01SyllabusFall2016

Page10of10

moveyouroutputtoyourhomedirectory.ONLYONERUNCANBECOMPLETEDATATIME,soifyouwanttousethisoption,youmustcoordinatewithyourclassmates.Anotheroptionistoinstalltheget_homloguespipelinedirectlyinyourhomedirectory.(100pts)

a. GetHomologuesscavengerhuntb. Pickagenomefromonetaxoninthetop5OTUsofyourMothuranalysis,plussixadditional

closelyrelatedstrains(samegenus),andperformtwoclusteringrunsusingget_homologues,onewithCOGtriangles,onewithOrthoMCL

c. CreateVenndiagramoutputshowingtheintersectionoftheseclustersd. Usingtheintersectingclustersonly,createadendrogramshowingrelationshipsamongyour

taxawithgenepresence-absenceinformatione. Usingtheintersectingclusters,createcoreandpan-genomeextrapolationcurveslikethose

inTettelinetal.2005,Figs2,3.f. Createacomparisonofthefollowing:

i. Thedendrogramcreatedin3,aboveii. PhylogenetictreesbasedontheaminoacidsequencesfromRecA,aribosomal

protein,andaDNApolymeraseg. IntegratethismaterialwithmicrobialecologydatafromyourMothuranalysis.Whatisthe

relativeabundanceofyourorganismsinthedatasetsyouexamined?Wheredotheysitontherank-abundancecurves?

h. Identifyatleastthreecandidatepathwayspresentinyourorganismthatcanhelpexplainwhyitisdominantinyourecologicaldata.Describehowthegenecontentofthesepathwaysisdifferentorsimilartotheother4organismswithwhichyouarecomparingit.

i. Createapolishedpresentationofthisinformation,withnomorethanoneslideperelement,withbriefsummariesforeachsection,thatisnomorethan15minuteslong.

j. Prepareonequestionforeachgroupbasedontheirpresentation.k. READMEandreportfiles.

AdditionalResources

1. Learningregex:http://www.regexr.com2. Linuxcheatsheet:http://peoplesofttutorial.com/learn-basic-linux-commands-using-linux-cheat-

sheet/3. Rosalindprogrammingtraining:http://rosalind.info/problems/locations/4. SoftwareCarpentrytraining:http://software-carpentry.org/index.html5. ElementsofBioinformatics:http://elements.eaglegenomics.com