Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
BIOL4800/01SyllabusFall2016
Page1of10
MicrobialBioinformaticsToolsBIOL4800/01,Fall2016When:1230-1320W,1230-1520FWhere:Coates169Instructor:J.CameronThrash,Ph.D.Email:[email protected]:@DrJCThrashMyoffice:A112LifeSciencesAnnexOfficehours:Byappointment(seebelow)Prerequisite:PermissionoftheDepartmentRecommendedtextbooks:
PracticalComputingforBiologists,Haddock&Dunn;Phylogenomics,DeSalle&Rosenfeld
Coursewebsite(s):MoodleeCommunicationPolicy:Thebestwaytocontactmeisthroughemailand/ortwitter.Iwilltrytorespondtoemailortwittermessageswithin6hours,exceptonweekendsorbetween2200and0700.Imayrespondmuchquicker,becauselikeyouIamgluedtomydevices,butIdohavealifeoutsideofteachingandresearch(whenI’mlucky).Ifyouwantanappointment,emailmewith1)ashortdescriptionofyourissue,and2)thedesiredtimeand3)durationofthemeeting.Thiswillbesubjecttomyavailability.Iacceptandencouragetwitterfollows,butIdonotacceptanyothersocialmediafriendrequests.Coursedescription.Inmodernbiology,theneedforcompetenceincomputationaltoolsisbecomingasubiquitousasthatfortraditionaltechniqueslikePCR.Thiscoursewillprovidebasictraininginnavigatingthecommand-lineenvironment,utilizingcommontoolsforgenomicsandecology,submittingjobstoHighPerformanceComputingclusters,andmanaginginputandoutputfiles.ItisNOTaprogrammingclass.Priorcomputationalexperienceishelpful,butnotrequired,asthegoalofthiscourseistobringneophytestoabasiclevelofcompetencewithsomecommonbioinformaticsmethods.Whilethefocuswillbeapplyingthesetomicrobiologicalresearch,manytoolsaresystem/organismindependent.ClasseswilltakeplaceinacomputerlabandwillhaveaccesstotheLSUHighPerformanceComputing(HPC)infrastructure.Eachweekwillconsistoftwohoursoftheory/practicallectureandtwohoursofpracticalhands-oncomputerlaboratoryexercises(plustake-homeexercises).The4800/01coursecanbetakenforcreditbyupper-levelundergraduatesandgraduatestudentsequally(3credithours).Courselearningoutcomes.Bytheendofthiscourse,youshouldbeableto:
• UnderstandthebasicsofaHPCinfrastructure• RemotelyaccessaHPCclusterusingthecommandline• NavigateandmanipulatethefilestructurewithinaLinuxenvironment• CompletebasicfilemanipulationtasksusingLinuxcommands• Writebasicshellscriptsforparsinginputandoutputfilesandsendingjobstothecomputenodes• Beabletoassessprogramperformanceusingdatasubsetstoaccuratelyestimateusage
requirements• DownloadgenomicinformationfrompublicdatabasesdirectlytoaHPCcluster• Executeparallel(threaded)analysesusingBLAST,HMMER3,andmultiplealignmenttools• Understandthemodernsequencingplatformmethodologies,capabilitiesandlimitations• Executephylogeneticinferencesfrommultiplealignmentswithdifferentplatforms• PerformbasicautomatedmicrobialgenomeassemblyandannotationwiththeA5pipeline• Assessthecoreandpan-genomeofagroupofcloselyrelatedmicroorganisms• CompleteoperationaltaxonomicunitclusteringandanalysisusingMothur.
Howwe’regoingtogetthere(CoursePhilosophyandFormat)
Circularrepresentationofmultiplebacterial
genomes(Groteetal.2012mBio)
BIOL4800/01SyllabusFall2016
Page2of10
Philosophy.Thiscourseisdesignedtogetyoutoabasicworkingknowledgeofmanyofthecommontoolsusedinmodernbioinformatics,particularlyasappliedtomicrobialgenomics.Thisisacombinedlecture/laboratorycourse,withthelaboratoryportionspentutilizingcomputersinsteadofatypicalwetlab.WhiletherewillbesomelecturecomponentduringtheWednesdayclass,asmuchaspossiblethisperiodwillhaveactivelearningexercisesinsteadofmesimplystandingaroundtalking.Extensiveresearchoneducationandtheneuroscienceoflearninghasshowntherearemuchmoreeffectivewaysforustolearnthanbysittingandlisteningtoapersonstandinthefrontoftheroomandtalk.Youdon’thavetocometoclasstolearnthatwayanyway,forthereareendlesslecturesandresourcesavailableonline,manyfromthemosteminentscientistsintheirfields.Someofthesewillbepartofyourpre-classassignments.Therefore,Iendeavortomakeclasstimeaseffectiveaspossibleforstimulatingyourinvestmentinthematerialandactivatingallmodesofthinking.Theaddedbenefitofbeingabletodothisworkyourselvesinthelabportionoftheclasswillhelpcompletetheprocess.Classroommechanics.Theone-hourWednesdayclasswillbechieflyconcernedwithintroducingthetoolswewillbelearningtouse.Thethree-hourFridayclasswillsometimeshaveapracticallectureforinstructionalpurposesandthenatleasttwohourstoworkonyourassignments(detailedbelow),whichwillbegradedaccordingtoaspecificrubric,andinvolveashortwrittencomponent.Therewillbesomeassignedreadings/podcasts/web-videos/lectureslidesyouwillberesponsibleforbeforeeachclassperiod,listedas“Readings,etc.”intheclassschedule,below.Iwillalsointroducethesepre-lectureassignmentseachweekbyemail.TherewillbeashortonlineMoodlequizonthematerialforWednesdaysthatclosesonehourbeforeclass.ComputationalRequirements.Ourclasseswillbeconductedinacomputerlab.Priortothefirstclass,youneedtohaverequestedanaccountwithLSUHPCforaccesstothesupercomputerSuperMike-II(https://accounts.hpc.lsu.edu/allocations.php).Youwillusethisaccountforcompletingclassroomexercisesandmajorassignments.Forin-classexercises,youwillbeusingthelabcomputersorapersonallaptopandloggingonthroughaterminal.Foryourmajorassignments,youwillneedanothercomputerwithterminallogincapabilitiessothatyoumayaccessSuperMike-IIremotely.Allexercisesinvolvingsignificantcomputationaleffortwillrequiretheuseofaclasscomputingallocation.DetailsforoperatingintheHPCenvironmentwillbepresentedduringthefirsttwoweeksofthecourse.Youwillbegradedonthefollowing:Quizzes 10%WeeklyAssignments 90%
Therewillbe1000totalpoints,gradedaccordingly:A-900-929;A930-969;A+>969B-800-829;B830-869;B+870-899C-700-729;C730-769;C+770-799D-600-629;D630-669;D+670-699F<600
Lateassignments.Assignmentswillsacrifice25%oftheirpointsperdaytheyarelate.Dealingwithchallenges.Makingmistakesandrunningintoroadblocksisinherenttotheprocessoflearning.Bydesign,thiscoursewillchallengeyoutofigureoutsolutions,notsimplygiveyouthepathfrompointAtopointZ.Thereasonforthisisthatthestruggletoovercomewhateverchallengeyoufaceiswherethetruelearninghappens.Therefore,whileyourfirstimpulsewhenyouencounteraproblemwillprobablybetoemailme,thiswillnotbemetwiththetypeofresponseyouarelookingforUNLESSyouhavedoneallofthefollowingproblem-solvingeffortsfirst,inthisorder:
1. Think.Reviewyourcommands,inputs,andoutputs.Seeifyoucanfigureoutwhatwentwrong.Oftenit’ssimplythataspaceismissingsomewhere,acommaismisplaced,oryou’reusinga‘insteadofa`.Trytofindthisyourselfbeforebotheringanyoneelse.
2. Consulttheinternet.Thepeoplewhohavedevelopedandusetheprogramsyouarelearninghavecreatedamassiveamountofonlineresources.Oftengooglingyourerrormessagewillallowyoutofindtheproblem.Creativegooglesearchingcanusuallydotherest.
3. Consultyourpeers.Iseveryoneintheclasshavingthesameproblem,orhassomeoneelsediscoveredthesolution?Whilethismayseemlikethesamethingasaskingtheprofessor,askingyourclassmatesfosterspeer-to-peerinstruction,whichreinforcesconceptsforthosewhogetthe
BIOL4800/01SyllabusFall2016
Page3of10
opportunitytoteachtheirsolutionandgivesadifferentperspectivetothosewhoareseekinganswers.
4. ConsultDr.Thrash.Ifyou’restillhavingproblems,byallmeanscontactmeorsetupanappointmentforofficehours.Solutionsaresometimesverysimplebutobscure.
OtherimportantinformationAbsences/CodeofStudentConduct.Youareexpectedtohaveread,understand,andadheretotheLSUAbsencePolicy(http://saa.lsu.edu/important-lsu-policies)andtheCodeofStudentConduct(http://saa.lsu.edu/code-student-conduct).Ourgoalshouldbetolearn,notsimplytogetgrades.Inscience,asinlife,yourintegrityisoneof,ifnotthe,mostvaluableassetyouhave.Preserveit,protectit,cultivateit.StudentswithDisabilities.Ifanyonehasadisabilitythatmayrequireaccommodation,youshouldimmediatelycontacttheofficeofServicesforStudentswithDisabilitiestoofficiallydocumenttheneededaccommodation.Theinstructormustbepresentedwiththisdocumentationduringthefirstweekofclass.Timerequirements.Itisexpectedthatyouwillhavereadorviewedtheassignedmaterialpriortoclassforthebackgroundnecessarytoproperlyparticipateintheactivitiesandthinkcriticallyabouttheconceptsaddressed.Asageneralpolicy,foreachhouryouareinclass,you(thestudent)shouldplantospendatleasttwohourspreparingforthenextclass.Sincethiscourseisforthreecredithours,youshouldexpecttospendaroundsixhoursoutsideofclasseachweekreadingorworkingonassignmentsfortheclass.ClassscheduleThescheduleispreliminaryandsubjecttochangedependingonhowquicklywearemovingthroughthematerial.Detailsonyourpre-classreadings,etc.,aresuppliedbelow.Class Week Date Subject Readings,etc. Assignment1 1 August24th(W) Thecommandlineenvironment 1-6 2 1 August26th(F) HPCTutorial WA013 2 August31st(W) BasicLinuxcommands 7-11 4 2 September2nd(F) MoreLinux,shelltexteditors WA025 3 September7th(W) MoreLinux,databaseaccess,bashscripts 12 6 3 September9th(F) Download,manipulatefasta/Genbankfiles WA037 4 September14th(W) Localalignment&dynamicprogramming 13-15 8 4 September16th(F) BLAST WA049 5 September21st(W) Multiplesequencealignment 16-19 10 5 September23rd(F) clustalW,MUSCLE,Gblocks WA0511 6 September28th(W) Singlegenephylogeny 20-24 12 6 September30th(F) RAxML,FastTree,clustalw WA0613 7 October5th(W) Genomesequencing 25,26 WA07 7 October7th(F) FallBreak 14 8 October12th(W) HiddenMarkovModels 27,28 15 8 October14th(F) HMMER3 WA0816 9 October19th(W) IntrotoA5 29,30 17 9 October21st(F) A5pipelineandevaluationstats WA0918 10 October26th(W) Annotation:theSEEDviewer 31,32 19 10 October28th(F) Analyzingandcomparingannotations WA1020 11 November2nd(W) MicrobiomesI:background 33-35 21 11 November4th(F) MicrobiomesI:OTUsandecologicalanalysis WA1122 12 November9th(W) MicrobiomesII:Mothur 36,37 23 12 November11th(F) MicrobiomesII:Mothur/analysis WA1224 13 November16th(W) Assessingcoreandpangenomes 38-40 25 13 November18th(F) Orthologydetermination,GetHomologues WA13 14 November23rd(W) Thanksgiving 14 November25th(F) Thanksgiving 26 15 November30th(W) PresentingWA13 27 15 December2nd(F) PresentingWA13
BIOL4800/01SyllabusFall2016
Page4of10
Readings,etc.tobecompletedbeforeclass1. ReadtheSyllabus2. ApplyforanHPCaccount(https://accounts.hpc.lsu.edu/login_request.php)3. SoftwareCarpentryUnixshelltutorials(https://v4.software-carpentry.org/shell/index.html)-
Introduction,FilesandDirectories,CreatingandDeleting4. ReviewtheHPCJumpstart.pdfandtheHPC@LSUwebsite(http://www.hpc.lsu.edu),familiarizing
yourselfwithAccountsandAllocations,theLSUHPCUsagePolicy,theUserGuideforSuperMikeII,andtheComputationalBiologytoolsavailable.
5. HaddockandDunn,Chapter4(Optional)6. HaddockandDunn,Chapter20(Optional)7. SoftwareCarpentryUnixshelltutorials(https://v4.software-carpentry.org/shell/index.html)-Pipesand
Filtersthroughremainingtutorials.8. SoftwareCarpentryregextutorials(https://v4.software-carpentry.org/regexp/index.html)-all.9. HaddockandDunn,Chapter2(Optional)10. HaddockandDunn,Chapter3(Optional)11. HaddockandDunn,Chapter5(Optional)12. DeSalleandRosenfeld,Chapter4(Optional)13. WebresearchontheBLASTsuite.14. Eddy,S.R.(2004).Whatisdynamicprogramming?NatureBiotechnology,22(7),909–910.(Optional)15. DeSalleandRosenfeld,Chapter5(Optional)16. WikipediaintrotoMSA:https://en.wikipedia.org/wiki/Multiple_sequence_alignment17. Edgar,R.C.(2004).MUSCLE:multiplesequencealignmentwithhighaccuracyandhighthroughput.
NucleicAcidsResearch,32(5),1792–1797.18. Castresana,J.(2000).Selectionofconservedblocksfrommultiplealignmentsfortheirusein
phylogeneticanalysis.MolecularBiologyandEvolution,17(4),540–552.19. DeSalleandRosenfeld,Chapter6(Optional)20. Wikipediaintrotophylogenetictrees:https://en.wikipedia.org/wiki/Phylogenetic_tree21. SlidesfromDr.JonathanEisen22. Price,M.N.,Dehal,P.S.,&Arkin,A.P.(2010).FastTree2--approximatelymaximum-likelihoodtreesfor
largealignments.PlosOne,5(3),e9490.23. Stamatakis,A.(2006).RAxML-VI-HPC:maximumlikelihood-basedphylogeneticanalyseswiththousands
oftaxaandmixedmodels.Bioinformatics,22(21),2688–2690.24. DeSalleandRosenfeld,Chapter8(Optional)25. Metzker,M.L.(2010).Sequencingtechnologies-thenextgeneration.NatureReviewsGenetics,11(1),31–
46.26. EvolutionofDNASequencingMethods,talkbyJonathanEisen:
https://www.youtube.com/watch?v=s9UbA7VyISQ27. Eddy,S.R.(1998).ProfilehiddenMarkovmodels.Bioinformatics,14(9),755.28. Eddy,S.R.(2011).AcceleratedProfileHMMSearches.PLOSComputationalBiology,7(10),e1002195.29. Tritt,A.,Eisen,J.A.,Facciotti,M.T.,&Darling,A.E.(2012).AnIntegratedPipelinefordeNovoAssembly
ofMicrobialGenomes.PlosOne,7(9),e42304.30. Coil,D.,Jospin,G.,&Darling,A.E.(2015).A5-miseq:anupdatedpipelinetoassemblemicrobialgenomes
fromIlluminaMiSeqdata.Bioinformatics,31(4),587–589.31. Edwards,D.J.,&Holt,K.E.(2013).Beginner'sguidetocomparativebacterialgenomeanalysisusingnext-
generationsequencedata.MicrobialInformaticsandExperimentation,3(1),2.32. Overbeek,R.,Olson,R.,Pusch,G.D.,Olsen,G.J.,Davis,J.J.,Disz,T.,etal.(2014).TheSEEDandtheRapid
AnnotationofmicrobialgenomesusingSubsystemsTechnology(RAST).NucleicAcidsResearch,42(Databaseissue),D206–14.
33. Goodrich,J.K.,DiRienzi,S.C.,Poole,A.C.,Koren,O.,Walters,W.A.,Caporaso,J.G.,etal.(2014).ConductingaMicrobiomeStudy.Cell,158(2),250–262.
34. Seekatz,A.M.,Aas,J.,Gessert,C.E.,Rubin,T.A.,Saman,D.M.,Bakken,J.S.,&Young,V.B.(2014).RecoveryoftheGutMicrobiomefollowingFecalMicrobiotaTransplantation.mBio,5(3),e00893–14–e00893–14.
BIOL4800/01SyllabusFall2016
Page5of10
35. Schloss,P.D.,Westcott,S.L.,Ryabin,T.,Hall,J.R.,Hartmann,M.,Hollister,E.B.,etal.(2009).Introducingmothur:Open-Source,Platform-Independent,Community-SupportedSoftwareforDescribingandComparingMicrobialCommunities.AppliedandEnvironmentalMicrobiology,75(23),7537–7541.
36. MothurMiSeqSOP37. KozichJJ,WestcottSL,BaxterNT,HighlanderSK,SchlossPD.(2013).Developmentofadual-index
sequencingstrategyandcurationpipelineforanalyzingampliconsequencedataontheMiSeqIlluminasequencingplatform.AppliedandEnvironmentalMicrobiology.79(17):5112-20.
38. Tettelin,H.,Masignani,V.,Cieslewicz,M.J.,Donati,C.,Medini,D.,Ward,N.L.,etal.(2005).GenomeanalysisofmultiplepathogenicisolatesofStreptococcusagalactiae:implicationsforthemicrobial"pan-genome".ProceedingsoftheNationalAcademyofSciences,102(39),13950–13955.doi:10.1073/pnas.0506758102
39. Grote,J.,Thrash,J.C.,Huggett,M.J.,Landry,Z.C.,Carini,P.,Giovannoni,S.J.,&Rappé,M.S.(2012).StreamliningandCoreGenomeConservationamongHighlyDivergentMembersoftheSAR11Clade.mBio,3(5),e00252–12.
40. Contreras-Moreira,B.,&Vinuesa,P.(2013).GET_HOMOLOGUES,aversatilesoftwarepackageforscalableandrobustmicrobialpangenomeanalysis.AppliedandEnvironmentalMicrobiology,79(24),7696–7701.doi:10.1128/AEM.02411-13
StayingOrganizedPartofanygoodcomputationalbiologyworkflowiskeepingyourinputsandoutputsorganized,andallyourprocessesandthecontentsofeachfileanddirectorydocumented.Thisnotonlyallowssomeoneelsetounderstandandreproduceyourwork,butpreventsyoufromforgettingthevaluablestepsyoutooktoproduceyourworkaswell.It’sahorriblefeelingtoenteradirectoryayearafterworkingonaprojectandnotrememberthecontentsofthefilesorhowtheywerecreated.Throughoutthesemester,wewillutilizeacommoncoresetoforganizationalprocedurestofacilitatekeepingorganized.Eachweekwillhaveaseparatedirectoryinyourhomedirectorywhereyouwillstoreinputsandoutputs,andpossiblyincludesubdirectories.YouwilldocumentthecontentsofeachfileandsubdirectoryinaREADMEdocument,includingoneforeachdirectory.Finally,foreachassignmentyouwillcreateabriefsummaryreport,describedbelow.Allthreeofthesedocumentswillbeinstrumentalinyourgrade.(Youmayfindthatwhenyoubranchoutonyourown,adifferentsystemmaysuityou.Regardless,itisimportanttoleaveatransparenttrailofallyourworksothatitcanberecreatedatanypointinthefuture.Thepointhereissimplytoenforcegoodpracticesincomputationalbiology,andwehavetopickonesysteminadvance.)ReportsEachweekyouwillbecompletingaseriesoftasksusingatoolorsetoftools.Aspartofyourassignmentsyouneedtoincludeashortwrittenreportwiththeelementsbelow.Ifthereportincludesonlytext,createitwithatexteditor(e.g.nano)andsaveas~/<workingdirectory>/report.txt.Ifitincludesgraphics,saveasa.docxor.pdffile,anduploadto~/<workingdirectory>/report.docx(or.pdf).
1. Name(s),date2. General(1-line)summaryofobjective(s)andpurpose(s)3. Workingdirectory4. Programsused,includingbasicscripts,andrelevantreference(s)5. Commands,inputs,outputsandresults/evaluationofoutputforNEWoperations.*
• Organizeinsectionsaccordingtotherubricinoutlineform.• Includespecificfilenames.• Forbatchjobs,indicatetheimportantcommand(s)andthenameofthePBSscript.• Forrepeatingtasks,onlydetailthefirstinstance,thenindicatethatthiswasrepeatedandnote
variationininput/outputfilenames.Similarly,outputonlyhastobeshownforafirstinstance.• *Foroperationsyouarerepeatingfrompreviousassignments(blastp,muscle,etc.),youmay
simplyreferenceapreviousreportforyourworkflow,butyouneedtobespecificenoughthatonecouldfindthecorrectcommandandrepeatit.
6. Personalreflection.Whatdidyoulearninadditiontousingtheassignedtool(s)?Whatwouldyoudodifferently?Whatareyoustillconfusedabout?
BIOL4800/01SyllabusFall2016
Page6of10
WeeklyAssignments(WA)WA01. LearningthecommandlineandHPCtutorial(40pts)
a. Pathhomework(relativevs.absolute).Createa~/week1/paths.txtfileonSuperMike-IIwiththefollowing,oneperline:
i. ArelativepathtoyourhomedirectoryonMikeii. TheabsolutepathtoyourhomedirectoryonMikeiii. Arelativepathtoyourworkdirectoryiv. Theabsolutepathtoyourworkdirectoryv. Arelativepathtoanotherstudent’shomedirectoryvi. Theabsolutepathtoanotherstudent’shomedirectory
b. Logging.Createatab-delimited~/week1/report.txtfilethatincludestheuniquecommandsyouhaveused(historyisveryhelpfulhere).MakesuretoincludewhatyouhavelearnedabouttheHPC@LSUandwhatyouarestillconfusedabout.Createatab-delimited~/week1/READMEfilethatcontainseachofthefilesineachdirectory,atabover,andabriefdescriptionofcontents.
WA02. BuildingonourLinuxskillsandincorporatingshelltexteditors(60pts)a. Ingroupsoftwo(number-assigned),researchandteachtheclassaboutoneofthefollowing
(agoodstartingpointwillbethe“cheatsheets”andtheirrespectiveweblocations):i. head,tailii. wciii. grepiv. >vs.>>v. sortvi. uniqvii. sedviii. piping
Youwillneedtoexplainthetool/concept,whatitcanbeusedfororwhereitisused,andprovideanexampleusingabasicfastafile.Eachgroupwillhave4minutes.Performthewholepresentationonthecommandline(i.e.,don’tcreatepowerpointsforthis).WewillstartonWednesdayandcontinueintoFridayifnecessary.
b. CreateasetoffivepipedcommandsthatutilizeanythreeoftheLinuxtools(filters)youlearnedlastweektomanipulateafastafile.Documentthesecommands,theirpurpose,andtheinputandoutputinyourreport.UsenanotocreatethetextwhileloggedintoSuperMike-II.
c. Logging.Createatab-delimited~/week2/report.txtfilethatincludestheuniquecommandsyouhaveused(historyisveryhelpfulhere).Createatab-delimited~/week2/READMEfilethatcontainseachofthefilesineachdirectory,atabover,andabriefdescriptionofcontents.
WA03. Bashscripts,downloading,andLinuxpractice(70pts)Createtwoseparatebashscriptsusingthetoolswe’vecoveredthusfar(orothersthatyouknowabout),runthem,anddocumenttheirinput/outputina~/week3/READMEfile:
a. Abashscriptyouruninyourworkingdirectoryb. APBSsubmissionscript,submittedviaqsub
Downloadthegenomesequencesforagroupofcloselyrelated(sameFamily)microorganismsfromGenBankandpracticemanipulatingfastafileswithbasiclinuxcommands.
c. Compilethenames,GenBankentries,andphylogenyinatab-delimitedfileforfiveorganisms.
d. Downloadproteinfastas,GenBankfiles,nucleotidefastas,andscaffoldfastasfortheorganismsyouidentified(total=20files).
e. Usepiping,linuxcommands,fastaToTab,tabToFasta,andgenbank_to_fasta.py,completethefollowing:
i. Convertyourgenbankfilesto.fastaii. Comparethenumberofgenesinthe.faafilesyoudownloadedwiththeconverted
files
BIOL4800/01SyllabusFall2016
Page7of10
iii. Splityour.faafilesintointofastafilesof100geneseachiv. Createasinglefilewithallthegeneannotationsfromallyourgenomes
f. Logging.READMEandreportfiles.WA04. Learningtoexecutethethreebasicaspectsoftheblastsuite-makingadatabase,searching
sequencesagainstadatabase,andqueryingthedatabaseforadditionalinformationusingalternativesearchinput.(80pts)
a. Makeadatabasefromyourgenomescaffoldsusingmakeblastdb.b. Executeaproteinblastagainstatranslatednucleotidedatabasewithtblastnusing100aa
sequencesfromagivengenomeagainstanotherstudent’sscaffolddatabase(withdifferentorganismsthanyours).
c. PerformanassessmentofblastefficiencyusingblastpagainstIMGv4,followingthe10,100,1000rule.Youwillneedtosplitproteinfastasequencesintosubsetsof10,100,and1000proteinsequencesandrunblastpwiththesesubsetsagainstthedatabaseusing1,2,4,and16processors(12totalblastpjobsubmissions).Usingyourstandardoutinformation,createtwographsoftheperformance,onewithnumberofsequencesvs.timeforagivensetofprocessors,theotherwiththeamountoftimepersequencesearchvs.numberofprocessors.Produceawrittensummaryofyourresultstoaccompanyyourgraphs.
d. Usingtheinformationfromyour10,100,1000assessment,executeathreadedBLASTsearchofthehypotheticalproteinsinyourfivegenomedatasetagainsttheIMGv4databasewiththeidealnumberofprocessorsandtimerequested.
e. Collectthesequencesforthetop100hitstooneoftheseproteinsusingblastdbcmd.Thiswillrequireyoutouseasetofpipedlinuxcommandsinconjunctionwithblastdbcmd,which,amongotherthings,acceptssequenceaccessionnumbersasinputandoutputsavarietyofinformation,includingthesequencedatainfastaformat.
f. READMEandreportfiles.WA05. Produceandvisualizethreadedmultiple-sequencealignments,editforpoorlycuratedsites,and
evaluatevariancebetweentwodifferentalignmentprograms.(70pts)a. In-classresearchonfastavs.phylipformattedalignments
i. Descriptionofthedifferencebetweenthetwoii. Listofthreetools/sitesthatdoconversion
b. In-classresearchonalignmentviewers-findthreec. Pickthreedifferentproteinsequencesinanyofyourgenomes,includingRecA,getthetop20
hitsfromtheIMGv4database.Foreachgene,placethequeryandhitsequencesintoasingle.faafile(21totalsequencesforeachofthe3proteinfastas).
d. AligneachfilebothMUSCLEandCLUSTAL.Visualizethealignmentswithgraphicalsoftware,comparebyeye.Describehowthealignmentforagivengenediffersbetweenprograms.
e. EditwithGblocksusingthesettingsfromSasseraetal.2011andnotethealignmentvariation.Youmayneedtoconvertyouralignmentsfromphyliptofastaformatfirst.
f. READMEandreportfiles.WA06. Executephylogeneticinferencesusingthreedifferentprogramsfor2ofthegenesfromyour
genomesand60tophitstoIMGv4.(70pts)a. Constructingatreeonpaperb. Identifyaribosomalproteinwith>100aminoacids,andoneothergeneinyourgenomethat
havetodowithcentralmetabolism,pathogenicity,orrespiration.PerformablastoftheirproteinsequencesagainstIMGv4,andcollectthetop50hits(initialsequenceincluded).Youwillalsoneedtopick2outgroupsfromblasthitswithconsiderablylessidentitytoyourquerysequencethanthetop50hits.
c. PerformMUSCLEalignmentsandcullwithGblocksusingthesettingsfromSasseraetal.2011.
d. ExecutephylogeneticanalysiswithPBSsubmissionstotheclusterusingClustalW(tocreateatreethistime,notanalignment),FastTree2,andRAxML.Thelatterwillneedtoberuninathreadedformatwith16processors.Someofthesewillneedinputdatainphylipformat.Use1000bootstrapsforClustalWandRAxML.
e. Usingatree-viewer,outputandcomparethetopologyandnodeconfidencebetweenthedifferenttreesforagivengene.Besuretorootyourtreeonyouroutgroupsequence.
BIOL4800/01SyllabusFall2016
Page8of10
Compileashortsummarywithtreegraphicsdescribingthesevariablesandaddtoyourreport.
f. READMEandreportfiles.WA07. TakeHomeEssay(therewillbeanin-classexerciseworth10pts).WA10:TakeHomeEssay.Finda
“microbiome”studyintheprimaryliterature,andidentifyanimportantorganisminthesystem.Createamaximumtwo-pageproposalforsequencingofthisorganism’sgenome(50pts).Inyourproposalyoumustinclude:
a. Yourmotivationforsequencingthisparticularstrain.Whatmakesitimportant?Whyshouldwecareaboutthisorganism?Includeecologicaldatathatdemonstratewhere/whenthisorganismisfound.Whatwillsequencingthisorganismhelpyoutounderstandaboutthesystemit’sin(thinkaboutthingslikephysiology,populationgenetics,etc.)?
b. Phylogeneticcontext.Whereinthetreeoflifedoesthisorganismsit?Whatareitsclosestrelatives?Areanyofthesealreadysequenced?
c. Sequencingparameters.Whattechnologywouldyouliketouse?Why?HowmuchDNAwillyouneed,howmanylanes/runs/etc.willyouuse,andhowmuchcoveragedoyouexpecttoget?
d. Allinformationneedstobecompletelyreferencedwithprimaryliterature,exceptinformulatinghowmuchcoverageyouwillgetforagiventechnology.Thiscancomefromwebsites,butmustbecitednonetheless.Noreferences,nocreditforentireassignment.
WA08. Creating,searching,andscanningHMMsforseveralofyourhypotheticalproteinsusingHMMER3.(70pts)
a. HMMscavengerhuntb. CreateHMMsforthreeofthehypotheticalproteinsinyourvariousgenomesthatare≥100
aminoacids,usingthetop30blastphitstoIMGasthefoundationforyourmultiplealignments
c. hmmsearchoneoftheseHMMsagainstRefSeq.CompareyourresultstoblastpsearchesagainstRefSeq,usingonlythetop15hitsfromeachsearch.
d. Asaclass,combineallofyourHMMsintoasinglefileandcreateaHMMdatabasetosearchagainstusinghmmscan.OnepersonmusthostthisdatabaseintheirBIOL4800directory.
e. hmmscananewhypotheticalproteinsequence(youwillhavetoidentifyadifferenthypotheticalfromoneofyourgenomes)againstthisdatabase,andnotethebesthit.
f. READMEandreportfiles.Besuretoincludewhichmodelsyourproteinsmatchbest.Notewhichgroupcreatedthatmodel,andwhatsequenceswereusedtocreateit.
WA09. ExecuteagenomeassemblyusingtheA5pipelineandanalyzethecompletedassemblyusingassemstats2.py(70pts).Youwillworkingroupsoftwotocompletethisassignment,aswellasneedtoconsulttwoothergroupstocompareyourassemblies.ThecompletedassignmentwillincludeacopyoftheoutputfilesfromtheA5assembly,atableofyourgenomestatistics,twotablescomparingyourgenomestatstotwoothergroups,abriefhalfpagewriteup,andacompletedsubmissiontoRAST.
a. In-classtextassemblywithyourgroupi. http://ivory.idyll.org/blog/the-assembly-exercise.html(TitusBrown)
b. Reflectivewritingi. Whatworkedanddidn’twithyour“genome”assembly?ii. Ifthe“genome”haderrorshowwouldyoucorrectforthem?iii. Whatwouldhavehappenediftherewererepeatsegmentsinyourgenome?
c. DownloadtherawsequencingdataforamicrobialgenomeofchoiceinGenBank.d. CompleteanassemblyofamicrobialgenomeusingA5.e. Onceassembled,examinetheassemblystatscreatedbyA5.
i. WhatdothesestatisticstellyouabouthowwellorpoorlyA5assembledyourgenomesequences?
f. Evaluateyourassemblycomparedtothoseofyourclassmates.Picktwoothersandcreateatablecomparingallthreeassemblies.
i. Howdoyourgenomescompare?ii. Whatmadeyourgenomehavea“better”or“worse”assemblywhencomparedto
thatofothers?
BIOL4800/01SyllabusFall2016
Page9of10
iii. Thisdiscussionshouldberoughlyahalfpagelongandaddresstheabovequestionsaswellasthequestionine.
g. SubmityourassemblytoRASTtobeannotated.h. READMEandreportfiles(include-thereflectivewritingdoneearlier).
WA10. ObtaintheannotatedoutputfromRAST,compareseveralsubsystemswithotherstrains,presentasummaryofthebasicfeaturesofyourassembly,andperformseveralrudimentaryanalysesbetweenthegenesfromyourassemblyandthoseofyourothergenomes(70pts).Workingwithyourproposal/assemblygroup:
a. SEEDviewerscavengerhunt.b. DeterminethefourmostcloselyrelatedorganismswithsequencedgenomesintheSEED
databasetothegenomeyou’veassembled.Hint,you’regoingtowanttoidentifyconservedgenesthatcanbeusedtolookforotherorganisms.
c. Puttogetheratablecomparingyourassembledcontigdatawithfourothercloselyrelatedgenomes,includingthefollowinginformationforallofthem:Organismname,Isolationsource,Genomesize(Mbp),numberofcontigs,GCcontent%,Totalnumberofgenes,Numberofproteincodinggenes,NumberoftRNAs
d. Compare,foryourassembledgenomeandtheclosestrelative,thegenepresence/absenceprofileforthefollowingsubsystems:glycolysisandgluconeogenesis,flagellarmotility,andmulti-drugeffluxpumps.
e. Pickthreeproteinsfromyourassembly,includingoneribosomalprotein,blasttheseproteinsagainstRefSeqandconstructphylogeniesforeachgenewithatleast15members,includinganoutgroup.Thiswillresultinthreetotaltrees.Youmayusewhateveralignmentandtree-buildingalgorithmyouwish,makingsuretocullwithGblocks.Outputthetreesandsummarizehowthetopologiesaresimilarand/ordifferentfromeachother.AlsonotewhetherornotthegenomesofyourotherfourorganismsarepresentintheRefSeqresults,andwhetherornottheyarestilltheclosestneighborstoyourgenome.
f. READMEandreportfiles.WA11. LearningtomeasureandestimatemicrobialdiversityinpreparationofusingMothur.(50pts)
a. In-classshortreflectiveessayonthedefinitionofOTUsandhowtheyareusedinmicrobialecology-explainfrommemory.Thentake10minutestodowebresearchonprimaryliterature.Re-defineOTUsagain,citingyourreferences.
b. Mark,release,recapture/rarefaction/relativeabundanceworksheet.Questionsandwritingsposedintheworksheetneedtobepartofthereport,alongwithfinaltablesandgraphs.
WA12. OTUanalysisofLSUMikereauxbiomedatausingMothur.(90pts)Youwillbeworkingingroupsoftwo,analyzingdatafromfoursamplesinoneofseveraldifferentsampletypes.MostofyourworkflowwillcomefromfollowingtheMothurMiSeqSOP,buttherewillbesomestepsthataremodifiedand/orleftoutfromtheSOP.Wewillidentifysomeoftheseinclass.
a. Createaworkflowforyouranalysis.Includeasmanyspecificcommandsaspossible,andannotatethesewiththeirpurpose.
b. Obtainthesequencedataforthesampleswithwhichyouwillbeworking,beingsuretoincludebothforwardandreversereads.PerformacompleteMothurrun,throughOTUclusteringandtaxonomicassignment.
c. Completethefollowinganalysesofyourdata:i. Rarefyyourdataii. Chao1richnessandinverseSimpsondiversityiii. RarefactioncurvesofOTUsvs.samplingeffortiv. Tableof#seqs,coverage,#OTUs,InverseSimpsonv. RelativeabundanceheatmapwithJaccardindexvi. VenndiagramofsharedOTUs,includingpredicted#ofoverlappingOTUsvii. Createfourrankabundancecurves-oneeachforthetop20OTUsineachofyour
samples.Compareyourtop10taxawiththosefromtheothergroups.WA13. Completecoreandpan-genomeanalysisofcloselyrelatedgenomeswithGetHomologues,reporting
thevariousadditionaloutputs.Incorporateadditionalmaterialfromsomeoftheothertoolsyou’vedealtwiththusfar.Formatforfinalpresentationtotheclass.Youmayworkindividuallyoringroups.Youwillneedtocdtotheget_homloguesdirectoryin/project/jcthrash/tools/,completeyourwork,andthen
BIOL4800/01SyllabusFall2016
Page10of10
moveyouroutputtoyourhomedirectory.ONLYONERUNCANBECOMPLETEDATATIME,soifyouwanttousethisoption,youmustcoordinatewithyourclassmates.Anotheroptionistoinstalltheget_homloguespipelinedirectlyinyourhomedirectory.(100pts)
a. GetHomologuesscavengerhuntb. Pickagenomefromonetaxoninthetop5OTUsofyourMothuranalysis,plussixadditional
closelyrelatedstrains(samegenus),andperformtwoclusteringrunsusingget_homologues,onewithCOGtriangles,onewithOrthoMCL
c. CreateVenndiagramoutputshowingtheintersectionoftheseclustersd. Usingtheintersectingclustersonly,createadendrogramshowingrelationshipsamongyour
taxawithgenepresence-absenceinformatione. Usingtheintersectingclusters,createcoreandpan-genomeextrapolationcurveslikethose
inTettelinetal.2005,Figs2,3.f. Createacomparisonofthefollowing:
i. Thedendrogramcreatedin3,aboveii. PhylogenetictreesbasedontheaminoacidsequencesfromRecA,aribosomal
protein,andaDNApolymeraseg. IntegratethismaterialwithmicrobialecologydatafromyourMothuranalysis.Whatisthe
relativeabundanceofyourorganismsinthedatasetsyouexamined?Wheredotheysitontherank-abundancecurves?
h. Identifyatleastthreecandidatepathwayspresentinyourorganismthatcanhelpexplainwhyitisdominantinyourecologicaldata.Describehowthegenecontentofthesepathwaysisdifferentorsimilartotheother4organismswithwhichyouarecomparingit.
i. Createapolishedpresentationofthisinformation,withnomorethanoneslideperelement,withbriefsummariesforeachsection,thatisnomorethan15minuteslong.
j. Prepareonequestionforeachgroupbasedontheirpresentation.k. READMEandreportfiles.
AdditionalResources
1. Learningregex:http://www.regexr.com2. Linuxcheatsheet:http://peoplesofttutorial.com/learn-basic-linux-commands-using-linux-cheat-
sheet/3. Rosalindprogrammingtraining:http://rosalind.info/problems/locations/4. SoftwareCarpentrytraining:http://software-carpentry.org/index.html5. ElementsofBioinformatics:http://elements.eaglegenomics.com