1
Contact information Read Depth yRD- Read Pair yRP- Split Read ySR- SNP and INDEL yGATK- MrsFAST Alignment BWA Alignment P3A3M3 yMerged CNVs- Filtering Summary of Variants 2 Config3 File 2 Reference Genome 2 Num3 of Processors 2 Raw Sequence Reads Annotation Annotation Preprocessing Read Splitting Disc3 Reads Count Reads Open Source Tools to $xploit UNJ Sequence Uata from Livestock Species y5- Sequencing y6- Prepare Files yk- Divide and Conquer Alignment yb- Variant Calling y/- Merger and Filtration y0- Results W/- UNJ Resequencing Cattle Breed Number of Animals Millions of base pairs (Mbp) Bos indicus Brahman 7 116.98 Bos indicus Gir 6 146.87 Bos indicus Nelore 8 170.44 Bos taurus Angus 17 1027.09 Bos taurus Holstein 32 718.95 Bos taurus Jersey 8 144.6 Bos taurus Limousin 7 155.47 Bos taurus Romagnola 4 92.57 The Problem: Proposed Solution: UNJ Sequence data allows researchers to identify variations within the genomes of individuals of a species that could impact important production traitsM however" tools designed to identify these variations are often designed specifically for mouse or human studiesC We are designing an openBsource pipeline for the analysis of UNJ sequencing studies involving nonBmodel organisms and agricultural speciesC The pipeline will be free for academic and research use and the source code for all written programs will be released with the final packageC Jdditionally" the results of the analysis will be provided to the endBuser in easily accessible forms" such as excel spreadsheets" text summaries and pdf file plotsC 9n order to make management of each step more flexible" we have implemented a configuration file control system for running the pipelineC The configuration file contains the locations of sequence files Wwfastqw files- and the locations of installed pipeline programs and scriptsC 9n order to speed up computation time" the pipeline can make use of multiple processor cores and high performance computing architecturesC Sequence files Wwfastqw files- are first split into smaller chunks and aligned to the reference genome using as many processors as allowedC Jlignment is a process that compares raw sequence data back to the original reference genome and identifies the part of the genome that the sequence likely originated fromC Xecause this pipeline splits up the sequence reads into smaller" manageable chunks" it is able to use multiple processors to speed up alignment Wthis strategy is called wUivide and zonquerw-C Jligned sequence data is then processsed using several programs in order to identify zopy Number Variants WzNVs-" Single Nucleotide Polymorphisms WSNPs- and 9nsertionVUeletions W9NU$Ls-C zNVs are called using three algorithms used by the :uman /000 7enomes Project [H] WSR" RP and RU- whereas SNPs and 9NU$Ls are called by the Xroad 9nstitute.s 7enome Jnalysis Toolkit W7JTK-C The results of the three zNV calling algorithms are merged using an algorithm we call wPrecision Jware Mergerw WPCJCMC- that takes into account the unique advantages and disadvantages of each methodC The SNP and 9NU$L calls are further filtered in order to improve quality zNV" SNP and 9NU$L calls are then cross referenced against information known about the genome in order to see if they impact genes or other functional genetic regions Wthis is called wJnnotationw-C Jfterwards" the results are tabulated and presented to the user in spreadsheets and plotsC WH- Prepare 8iles W0- Uivide and zonquer Jlignment WG- Variant calling Wj- Merger and 8iltration W[- 8inal Results Uerek MC Xickhart / " @ana LC :utchison / " Lingyang Xu H"0 " @iuzhou Song 0 " 7eorge $C Liu H / USUJ" JRS" Jnimal 9mprovement Program Laboratory" XJRz H USUJ" JRS" Xovine 8unctional 7enomics Laboratory" XJRz 0 University of Maryland" Uepartment of Jnimal and Jvian Sciences" zollege Park" MU A Test Run: 100 Sequenced Bulls The :uman 7enome Project" which was the first major UNJ sequencing project for a large $ukaryotic genome" lasted /0 years and cost approximately y0 billionC The cost to sequence new genomes has dropped precipitously since that project concluded" and current price estimates hover around y["000 to y]"000 per individual sequencedC Much of this price reduction is due to new methods of sequencing and improved instrumentation WeCgC the :iSeq H000M pictured right inset-C Our pipeline was designed for the explicit purpose of analyzing UNJ sequence data from organisms that already have a finished reference genome projectC Js of the writing of this poster" this includes [ agricultural animal species and /2 agricultural plant species with many more to shortly followC The reason why researchers resequence individuals of a species that already have a completed reference genome is to identify variations in UNJ sequence within that individual.s genomeC These variations can be linked to disease susceptibility Wpowdery mildew susceptibility in Jrabidopsis Wa-- or productive traits Wwhite coat color in Sheep Wb--C MGCls4c9s_a MGCls4c9s_b 5c lcc l5c dcc d5c Copy Number PAGl5 GIMAP4 GIMAP7 PAGdl CATHL4 TUBAlB BNBDlc_b PAG6 BNBDlc_a TAP LAP ISGld8AK GIMAPl DEFB5 LOClccld68l5 KRTAP9d DEFBl BNBD4 FBXOl6 IFNBl IFNBs SUHWd LOC78c876 c lc dc sc 4c 5c BINE BTANl BTANd BTANs BTHO DTTRACE Antimicrobial peptides Jnalysis zopy Number 9ndicates $xpansion of 9mmune System 7enes Table: Our dataset was composed of eight different breeds of cattle from two cattle subspeciesK Bos indicus Wor wzebuw- and Bos taurus. $ach individual was sequenced to at least [X coverage" and several animals were sequenced to greater depth in order to provide a contrastC Figure: J summary of all variants currently detected using our pipeline shows a clear difference between indicus and taurus subspecies in the number of SNPs identifiedC 9nsertions and deletions vary far more among breeds than subspecies and may reflect smaller phenotypic differencesC Summary of Variants Uetected by the Pipeline Figure: Jssociation of genetic features with variants WJnnotation- has revealed several interesting biological features [/]C Jntimicrobial peptides" which serve as a first line of defense for the immune system" are often duplicatedC This may indicate an evolutionary warmsBracew between the animal and environment as bacteria evolve to resist Jntimicrobial compounds over timeC More Information Pipeline Start Pipeline Conclusion References [/] Xickhart" et alC H0/HC zopy Number Variation of 9ndividual zattle 7enomes using NextB7eneration SequencingC 7enome ResearchC HHK 22]B2'0 [H] Mils" et alC H0//C Mapping zopy Number Variation by PopulationBScale 7enome SequencingC NatureC G20C j'B[j Project Source zode WJlpha Stage- sourceforgeCnetVprojectsVcosvardV Uerek Xickhart USUJ JRS J9PL derekCbickhart6arsCusdaCgov PhoneK W00/- j0G B ]j'H This work was supported by NRIvAFRI grant no3 645520745/2k458k from the USDA NIFA

gggg$gUJggUg f Contasion gg - aipl.arsusda.gov · gg y5-sequencing y6-ss yk-ssss yb-ss y/-sss y0-ssults W/-gUJg Cattle Breed Number of Animals Millions of base pairs (Mbp) Bos indicusBrahman

  • Upload
    vuhanh

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: gggg$gUJggUg f Contasion gg - aipl.arsusda.gov · gg y5-sequencing y6-ss yk-ssss yb-ss y/-sss y0-ssults W/-gUJg Cattle Breed Number of Animals Millions of base pairs (Mbp) Bos indicusBrahman

Contactsinformation

ReadsDepthsyRD-

ReadsPairsyRP-

SplitsReadsySR-

SNPsandsINDELsyGATK-

MrsFASTsAlignments

BWAsAlignment

P3A3M3yMergedsCNVs-

Filtering

Summarysof

Variants

2sConfig3sFile2sReferencesGenome2sNum3sofsProcessors2sRawsSequencesReads

Annotation

Annotation

Prep

rocessin

g

Rea

dsS

plit

tin

g

Disc3sR

eads

Cou

ntsR

eads

OpengSourcegToolsgtog$xploitgUNJgSequencegUatagfromgLivestockgSpecies

y5-sSequencing y6-sPreparesFiles yk-sDividesandsConquersAlignment yb-sVariantsCalling y/-sMergersandsFiltration y0-sResults

W/-gUNJgResequencing

Cattle BreedNumber

ofAnimals

Millions ofbase pairs

(Mbp)

Bos indicus Brahman 7 116.98

Bos indicus Gir 6 146.87

Bos indicus Nelore 8 170.44

Bos taurus Angus 17 1027.09

Bos taurus Holstein 32 718.95

Bos taurus Jersey 8 144.6

Bos taurus Limousin 7 155.47Bos taurus Romagnola 4 92.57

ThefProblem:

ProposedfSolution:f

ggggggggggggggggggggggggggggggggggggUNJgSequencegdatagallowsgresearchersgtogidentifygvariationsgwithingtheggenomesgofgindividualsgofgagspeciesgthatgcouldgimpactgimportantgproductiongtraitsMghowever"gtoolsgdesignedgtogidentifygthesegvariationsgaregoftengdesignedgspecificallygforgmousegorghumangstudiesC

ggggggggggggggggggggggggggggggggggggggggggggggggggWegaregdesigninggangopenBsourcegpipelinegforgtheganalysisgofgUNJgsequencinggstudiesginvolvinggnonBmodelgorganismsgandgagriculturalgspeciesCgThegpipelinegwillgbegfreegforgacademicgandgresearchgusegandgthegsourcegcodegforgallgwrittengprogramsgwillgbegreleasedgwithgthegfinalgpackageCgJdditionally"gthegresultsgofgtheganalysisgwillgbegprovidedgtogthegendBusergingeasilygaccessiblegforms"gsuchgasgexcelgspreadsheets"gtextgsummariesgandgpdfgfilegplotsC

9ng orderg tog makeg managementg ofg eachg stepg moreg flexible"g weg havegimplementedg ag configurationg fileg controlg systemg forg runningg thegpipelineCg Theg configurationg fileg containsg theg locationsg ofg sequencegfilesg Wwfastqwg files-g andg theg locationsg ofg installedg pipelineg programsgandgscriptsCg 9ngorderg togspeedgupgcomputationgtime"g thegpipelinegcangmakeg useg ofg multipleg processorg coresg andg highg performancegcomputinggarchitecturesCg

SequencegfilesgWwfastqwgfiles-garegfirstgsplitgintogsmallergchunksgandgalignedgtogthegreferenceggenomegusinggasgmanygprocessorsgasgallowedCgJlignmentgisgagprocessgthatgcomparesgrawgsequencegdatagbackgtogthegoriginalgreferenceggenomegandgidentifiesgthegpartgofgtheggenomegthatgthegsequenceglikelygoriginatedgfromCgXecausegthisgpipelinegsplitsgupgthegsequencegreadsgintogsmaller"gmanageablegchunks"gitgisgablegtogusegmultiplegprocessorsgtogspeedgupgalignmentgWthisgstrategygisgcalledgwUividegandgzonquerw-C

Jlignedg sequenceg datag isg theng processsedg usingg severalg programsg ingorderg tog identifyg zopyg Numberg Variantsg WzNVs-"g Singleg NucleotidegPolymorphismsg WSNPs-g andg 9nsertionVUeletionsg W9NU$Ls-Cg zNVsg aregcalledgusinggthreegalgorithmsgusedgbygtheg:umang/PPPg7enomesgProjectg[H]g WSR"g RPg andg RU-g whereasg SNPsg andg 9NU$Lsg areg calledg byg thegXroadg9nstitute.sg7enomegJnalysisgToolkitgW7JTK-C

ThegresultsgofgthegthreegzNVgcallinggalgorithmsgaregmergedgusinggangalgorithmgwegcallgwPrecisiongJwaregMergerwgWPCJCMC-gthatgtakesgintogaccountgtheguniquegadvantagesgandgdisadvantagesgofgeachgmethodCgThegSNPgandg9NU$Lgcallsgaregfurthergfilteredgingordergtogimprovegquality

zNV"g SNPg andg 9NU$Lg callsg areg theng crossgreferencedg againstg informationg knowng aboutg theggenomegingordergtogseegifgtheygimpactggenesgorgothergfunctionalg geneticg regionsg Wthisg isg calledgwJnnotationw-CgJfterwards"gthegresultsgaregtabulatedgandgpresentedgtogthegusergingspreadsheetsgandgplotsC

WH-gPrepareg8ilesW0-gUividegandgzonquergJlignment

WG-gVariantgcalling

Wj-gMergergandg8iltration

W[-g8inalgResults

UerekgMCgXickhart/"g@anagLCg:utchison/"gLingyanggXuH"0"g@iuzhougSong0"g7eorgeg$CgLiuHg

/USUJ"gJRS"gJnimalg9mprovementgProgramgLaboratory"gXJRzHUSUJ"gJRS"gXovineg8unctionalg7enomicsgLaboratory"gXJRz0UniversitygofgMaryland"gUepartmentgofgJnimalgandgJviangSciences"gzollegegPark"gMU

AfTestfRun:f100fSequencedfBullsTheg:umang7enomegProject"gwhichgwasgthegfirstg majorg UNJg sequencingg projectg forg aglargeg $ukaryoticg genome"g lastedg /Pg yearsgandgcostgapproximatelygy0gbillionCgThegcostgtog sequenceg newg genomesg hasg droppedgprecipitouslyg sinceg thatg projectg concluded"gandg currentg priceg estimatesg hoverg aroundgy["PPPgtogy]"PPPgpergindividualgsequencedCgMuchgofgthisgpricegreductionggisgduegtognewgmethodsg ofg sequencingg andg improvedginstrumentationg WeCgCg theg :iSeqg HPPPMgpicturedgrightginset-Cgggg

Ourg pipelineg wasg designedg forg theg explicitg purposeg ofganalyzingg UNJg sequenceg datag fromg organismsg thatgalreadyghavegagfinishedgreferenceggenomegprojectCgJsgofgtheg writingg ofg thisg poster"g thisg includesg [g agriculturalganimalg speciesg andg /2g agriculturalg plantg speciesg withgmanyg moreg tog shortlyg followCg Theg reasong whygresearchersg resequenceg individualsg ofg ag speciesg thatgalreadyghavegagcompletedgreferenceggenomegisgtogidentifygvariationsg ing UNJg sequenceg withing thatg individual.sggenomeCg Theseg variationsg cang beg linkedg tog diseasegsusceptibilityg Wpowderyg mildewg susceptibilityg ingJrabidopsisgWa--gorgproductivegtraitsgWwhitegcoatgcolorgingSheepgWb--C

MGCls4c9s_aMGCls4c9s_b

5c lcc l5c dcc d5c

CopyDNumber

PAGl5

GIMAP4

GIMAP7

PAGdl

CATHL4

TUBAlB

BNBDlc_b

PAG6

BNBDlc_a

TAP

LAP

ISGld8AK

GIMAPl

DEFB5

LOClccld68l5

KRTAP9−d

DEFBl

BNBD−4

FBXOl6

IFNBl

IFNBs

SUHWd

LOC78c876

c lc dc sc 4c 5c

BINEBTANlBTANdBTANsBTHODTTRACE

Antimicrobialpeptides

JnalysisgzopygNumberg9ndicatesg$xpansiongofg9mmunegSystemg7enes

Table:fOurgdatasetgwasgcomposedgofgeightgdifferentgbreedsgofgcattlefromgtwogcattlegsubspeciesKgBos indicus Worgwzebuw-gandgBos taurus.

$achgindividualgwasgsequencedgtogatgleastg[Xgcoverage"gandgseveralganimalsgweregsequencedgtoggreatergdepthgingordergtogprovidegagcontrastC

Figure:gJgsummarygofgallgvariantsgcurrentlygdetectedgusinggourgpipelineg showsg ag clearg differenceg betweeng indicusg andg taurusgsubspeciesg ing theg numberg ofg SNPsg identifiedCg 9nsertionsg andgdeletionsgvarygfargmoregamonggbreedsgthangsubspeciesgandgmaygreflectgsmallergphenotypicgdifferencesC

SummarygofgVariantsgUetectedgbygthegPipeline

Figure:g Jssociationg ofg geneticg featuresg withg variantsg WJnnotation-g hasgrevealedg severalg interestingg biologicalg featuresg [/]Cg Jntimicrobialg peptides"gwhichg serveg asg ag firstg lineg ofg defenseg forg theg immuneg system"g areg oftengduplicatedCg Thisg mayg indicateg ang evolutionaryg warmsBracewg betweeng theganimalgandgenvironmentgasgbacteriagevolvegtogresistgJntimicrobialgcompoundsgovergtimeC

MorefInformation

Pipeline Start Pipeline

Conclusion

References[/]gXickhart"getgalCgHP/HCgzopygNumbergVariationgofg9ndividualgzattleg7enomesgusinggNextB7enerationgSequencingCg7enomegResearchCgHHKg22]B2'P[H]gMils"getgalCgHP//CgMappinggzopygNumbergVariationgbygPopulationBScaleg7enomegSequencingCgNatureCgG2PCgj'B[jg

ProjectgSourcegzodeggWJlphagStage-sourceforgeCnetVprojectsVcosvardVgg

UerekgXickhartUSUJgJRSgJ9PL

derekCbickhart6arsCusdaCgov

PhoneKgW0P/-gjPGgBg]j'H

ThissworkswasssupportedsbysNRIvAFRIsgrantsno3s645520745/2k458ksfromsthesUSDAsNIFA