Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ReadMappingandVariantCalling
WholeGenomeResequencing• Sequencingmul:ple
individualsfromthesamespecies
• Referencegenomeisalreadyavailable
• Discovervaria:onsinthegenomesbetweenandwithinsamples– muta:ons– inser:ons– dele:ons– rearrangements– copynumberchanges
Howlongdothereadsneedtobe?Forthehumangenome,es:matesare:25mers=80%uniquecoverage43mers=90%uniquecoverageButlongerisbePerforcertainapplica:ons.
Workflow
QualityAssessment
Trimming
QualityAssessment
MappingtoaReference
Visualiza:on
Callingvariants
Assessingfunc:onalimpactofvariants
SubmittoSRA
Butwe’vealreadycoveredalignmentwithBLAST.
• Alignmentmethodsmusthavetradeoffsforspeedvsaccuracy
• Dependingontheapplica:on,maywanttomakedifferenttradeoffs
• Globalvslocal• Differenttypesofalignmentobjec:vesleadtodifferent
categoriesofaligners
ShortReadMappers• BLASTismuchfasterthanoriginalalgorithms(SmithWatermanforexample)
• S:lltooslowfortheamountofdataproducedbyNGStechnology
• Resequencingusuallyinvolvescomparingverysimilarsequences(>90%iden:tyinresidues)toareferencegenome
• Soawarethatleveragesthishighpercentiden:tycanbefaster
• Cangenerallyu:lizeaglobalstrategy
ShortReadMappers
• OrdersofmagnitudefasterthanBLAST• severaltensofmillionsofreadsmappedperhourperCPU
• Onlymatchesof95%iden:tyorgreaterarefound• Indelsarepar:cularlyproblema:c• Usuallyonlyoutputthebesthitorthesetofhitsallequivalentlygood– Thepointisusuallytofindtheorigininthereferencegenome
– Othergenomicregionsofloweriden:tyarenotconsidereduseful
Uniqueness• Somereadscanbemappeduniquelytothereference
• Somecan’t–>Mul:plealignmentreads• Mul:plealignmentreadsaredifficulttoapplytodownstreamapplica:ons– RNASeq–whichgenedotheyrepresent?– SNP–whichloca:oncarriesthepolymorphism?
• Howtodealwithmul:plealignmentreads?– Throwthemaway– Thisintroducesbiasandignoresrealgenomicregionsthatmaybebiologicallyimportant
CleverTrickstofind“Best”Alignments
• Usethequalityvalues– Penalizemismatchesathighqualitybasesmorethanmismatchesatlowqualitybases
• PairedEndinforma:on– Ifonereaddoesnotmapuniquely,buttheotherdoes,usethatinforma:ontoplacethenon-uniqueone
– Needtoknowyourinsertsize
Decisionsfortheenduser
• Howmanymismatchesareallowedforareadtobeconsideredmapped?– Heterozygositybetweensampleandreference– Incomplete/lowqualityreference
• Howmanymatchestoreport?– Doesyourdownstreamanalysisneed/wanttoincludemul:plematches?
Explorethedocumenta1onandparametersforyour
so5wareofchoiceIsitdoingwhatyouthinkitsdoing?
Lotsofchoices65soawarepackageslistedonthewikipediapageforalignmentsoaware
BlatBWAElandGSNAPMaqRMAPStampySHRiMP
Prizeforbestnamed:VelociMapper
Howtochoose?
• Memoryefficient• Gooddocumenta:on• Responsivemailinglistorhelpforum• Maintainedandupdatedwhenbugsarefound– BWA
Whatmappershaveincommon:IndexingStrategies
• Usually,thefirststepistotransformpartofthedataintoamoresuitableformforfastsearching
• Indexing–crea:ngaglossaryorlookuptable
• Withoutindexingyouwouldhavetoscaneverythingeach:meyoudidasearch
• Considerwebsearchengines
• Hasthreealgorithms• Individualchromosomescannotbelongerthan2GB
• OutputinSAMformat
• threealgorithms:• BWA-backtrack
– Meantforsequencesofupto100bpinlength– Meantforreadswithlessthan2%error(candosomeendtrimming)
• BWA-MEM– Useforanysequencesgreaterthan70bpupto1Mb– Muchmorewidelyusednowthatsequencersoutputlongerreads– Willworkwithreadswith
• 2%errorfor100bp• 3%errorfor200bp• etc
– Hassplitreadsupport• structuralvaria:ons,genefusionorreferencemisassembly
• BWA-MEMismoreaccurateandfaster
hPp://bio-bwa.sourceforge.net/
SAMANDBAMFORMAT
SAMFormat• SAM=SequenceAlignment/
Mapformat• Tabdelimitedplaintext• Storelargenucleo:de
sequencealignments– Alignmentofeveryread– IncludinggapsandSNPs– Pairingofreads– Canrecordmorethanone
alignmentloca:oninthegenome
– Storesqualityvalues– Storesinforma:onabout
duplica:on
• Flexible• Usefulforopera:onsonvery
largesequences• Extremelydetailed
documenta:on– hPps://samtools.github.io/hts-
specs/SAMv1.pdf• Manipula:onsareprimarily
donewiththesoawaresamtools
• Originallydesignedtostoremappinginforma:on,nowusedasaprimarystorageformatforunmappedsequencesaswell
SAM-Header
• Structure– Op:onalHeaderattopoffile
– Alignmentinforma:on
@@@@@ReadReadRead
SAM-Header• Headerlinesstartwith@symbol• Alwaysattopoffile• Containlotsofinforma:onaboutwhatwasmapped,whatitwasmappedto,andhow(metadata)– theversioninforma:onfortheSAM/BAMfile– whetherornotandhowthefileissorted– informa:onaboutthereferencesequences– anyprocessingthatwasusedtogeneratethevariousreadsinthefile
– soawareversion
SimpleHeader@HD=firstline
VN=versionofSAMformat SO=sortorder(thisissortedbycoordinates)
@HD VN:1.5 SO:coordinate@SQ SN:ref LN:45
@SQ=referencesequence SN=SequencereferenceName LN=seqeuncereferencelength
Decipherheaderinforma:on:hPps://samtools.github.io/hts-specs/SAMv1.pdf
AlignmentLine• Belowtheheadersarethealignmentrecords• Tab-delimitedfields
• 1 QNAME Querytemplate/pairNAME• 2 FLAG bitwiseFLAG• 3 RNAME ReferencesequenceNAME• 4 POS 1-basedleamostPOSi:on/coordinateofclippedsequence• 5 MAPQ MAPpingQuality(Phred-scaled)• 6 CIGAR extendedCIGARstring• 7 MRNM MateReferencesequenceNaMe(`='ifsameasRNAME)• 8 MPOS 1-basedMatePOSis:on• 9 TLEN inferredTemplateLENgth(insertsize)• 10 SEQ querySEQuenceonthesamestrandasthereference• 11 QUAL queryQUALity(ASCII-33givesthePhredbasequality)• 12+ OPT variableOPTionalfieldsintheformatTAG:VTYPE:VALUE
Letsunpackthisalignmentline,takenfromaSAMfile:SRR030257.2000020 83gi|254160123|ref|NC_012967.1|32957526036M = 3295706-82TGCTGGCGGCGATATCGTCCGTGGTTCCGATCTGGT?%<91<?>>??AAAAAAAAAAAAAAAAAAAAAAAAAXT:A:NM:i:0 SM:i:37 AM:i:37 X0:i:1X1:i:0XM:i:0 XO:i:0 XG:i:0 MD:Z:36
SAMField1
Query name
SRR030257.2000020
2. Flag:83
Field 2:Flag
83
64 + 16 + 2 + 1
1 = Read is paired2 = Read mapped in proper pair16 = Read mapped to reverse strand64 = First in pair
LookupaSAMflag:hPps://broadins:tute.github.io/picard/explain-flags.html
SAMField3
Referencesequencename(usefulespeciallyifyouhavemul:plechromosomes)gi|254160123|ref|NC_012967.1|
SAMField4
Posi:on-1-basedleamostmappingPOSi:onofthefirstmatchingbase3295752
SAMField5
MappingQuality• equals−10log10Pr{mappingposi:oniswrong},roundedtothenearestinteger
• Probabilityof99.9%=mapqualityof30• Probabilityof0%=mapqualityof0• value255indicatesthatthemappingqualityisnotavailable.
60 .000001% probability wrong
SAMField6CIGARString36M 36 nucleotides match (perfect match)8S28M 8 nucleotides clipped, 28 match
MoreCIGAR
Position: 5CIGAR: 3M1I3M1D5M
SAMField7
Referencesequenceforthenextreadinthetemplate• Foraforwardread,thisisthereferencewherethereversereadmaps
• Forareverseread,thisisthereferencewheretheforwardreadmaps
= reverse read maps on the same reference
SAMField8
Posi:onwherethenextreadmaps
3295706
(Forward read mapped at 3295752. Remember the forward read mapped to the reverse strand)
SAMField9
Observedtemplatelength
-82
SAMField10
Sequenceoftheread
TGCTGGCGGCGATATCGTCCGTGGTTCCGATCTGGT
SAMField11
Qualityoftheread
?%<91<?>>??AAAAAAAAAAAAAAAAAAAAAAAAA
SAMField12
Op:onalMOREinforma:onTAG:TYPE:VALUEformat
XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
Anything with an X is specified by the user or by the mapping software, and is not part of the SAM spec.
DecipherthelastfieldsXT:A:U One of Unique/Repeat/N/Mate-swNM:i:0 Edit distance to the referenceSM:i:37 Template-independent mapping qualityAM:i:37 Smallest template-independent mapping
quality of other segmentsX0:i:1 Number of best hitsX1:i:0 Number of suboptimal hits found by BWAXM:i:0 Number of mismatches in the alignmentXO:i:0 Number of gap opensXG:i:0 Number of gap extentionsMD:Z:36 String for mismatching positions
SAMThefirstthreelinesofasamfile:@SQ SN:gi|254160123|ref|NC_012967.1| LN:4629812@PG ID:bwaPN:bwa VN:0.7.12-r1039 CL:/lustre/projects/rnaseq_ws/apps/bwa-0.7.12/bwasampe../raw_data/NC_012967.1.fastaaln_SRR030257_1.saialn_SRR030257_2.sai../raw_data/SRR030257_1.fastq../raw_data/SRR030257_2.fastqSRR030257.1 99 gi|254160123|ref|NC_012967.1|
950180 60 36M = 950295 151TTACACTCCTGTTAATCCATACAGCAACAGTATTGGAAA;A;AA?A?AAAAA?;?A?1A;;????566)=*1 XT:A:UNM:i:1SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:1XO:i:0XG:i:0 MD:Z:32C3
BAMFormat• SisterformattoSAM• BAM–BinaryversionofSAM• compressedBGZF(BlockedGNUZipFormat)-avariantofGZIP
(GNUZIP),• filesarebiggerthanGZIPfiles,buttheyaremuchfasterforrandom
access• Canindexandthenlookupinforma:onembeddedinthefilewith
decompressingthewholefile• upto75%smallerinsize• Notreadablebypeople
hPp://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-bePer-gzip.html
samtools• View–printalignmentstoyourscreenorconvertbetweenformats.Canreduce
filestoapar:cularregiononly• Tview-textalignmentviewer,niayforquickviewingoffiles• Mpileup–generatesaspecialmpileupformaPedfileneededforcallingvariants• Sort–sortthealignments(bydefault,sortsbycoordinate).Sor:ngisneededfor
mostdownstreamapplica:ons.• Merge–concatenatebamfilestogether,whilemaintainingsor:ngorder• Index-indexabamorcramfile,neededformostdownstreamapplica:ons• Idxstats–getsomestatsaboutyourbamfile• Faidx-indexafastafile,needformostdownstreamapplica:onsusingabamfile• Bam2fq–convertabamfiletoafastqfile• More…
AlwaystheformatSamtoolssubcommand–flags–moreflags
SNPCALLING
SNPCalling
• SNP=singlenucleo:depolymorphism
• Indel=inser:on/dele:on
• Examinethealignmentsofreadsandlookfordifferencesbetweenthereferenceandtheindividual(s)beingsequenced
Fengetal2014.NovelSingleNucleo:dePolymorphismsoftheInsulin-LikeGrowthFactor-IGeneandTheirAssocia:onswithGrowthTraitsinCommonCarp(CyprinuscarpioL.)
SNPCalling• Difficul:es:– Cloningprocess(PCR)ar:facts– Errorsinthesequencingreads– Incorrectmapping– Errorsinthereferencegenome
• HengLi,developerofBWA,lookedatmajorsourcesoferrorsinvariantcalls*:– erroneousrealignmentinlow-complexityregions– theincompletereferencegenomewithrespecttothesample
*Li2014TowardbePerunderstandingofar:factsinvariantcallingfromhigh-coveragesamples.Bioinforma:cs.
Indel
Canbedifficulttodecidewherethebestalignmentactuallyis
*hPps://bioinf.comav.upv.es/courses/sequence_analysis/snp_calling.html
SNPCallingSoaware• Samtool’smpileup->bcaools• GATK(GenomeAnalysisTooklit)• FreeBayes
• Takeintoaccountthequalityvalueoftheindividualbaseandthequalityvalueforthealignmentoftheread
• Canu:lizeinforma:onaboutpreviouslycalled/confirmedSNPs
Steps
• Preprocessingreadalignments*– PicardMarkDuplicates– GATKbasequalityscorerecalibra:on– GATKrealignmentaroundindels
• CalltheSNPswithsoawareofyourchoice• FiltertheSNPs– Goaltoremovefalseposi:ves
Mayormaynotbeworthpreprocessing:hPps://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detec:on-methods-ensemble-freebayes-and-minimal-bam-prepara:on-pipelines/
Filtering• Depth
– HowmanyreadsdoyouneedtosampletoconfidentlycallaSNP?– >20X=verygood– 5-20X=okay– <5X=missingmanyheterozygouscalls
– Lowdepthmaybeovercomeusingsta:s:cs–seeoverviewofgenotypelikelihoodshere:
– hPp://www.ncbi.nlm.nih.gov/pmc/ar:cles/PMC3593722/
• Lowquality–usethequalityes:matesprovidedbythesoaware• Numberofalleles(monomorphicvsbiallelic)• Highcoverage–canindicateaduplicatedregioninthegenome• Highlyvariableregion–canalsoindicateaduplicatedregion• Lowcomplexityregions
Yourmileagemayvary• Differentdecisionsabouthowtoalignreadsandiden:fyvariantscanyieldverydifferentresults
• 5pipelines• “SNVconcordancebetweenfiveIlluminapipelinesacrossall15exomeswas57.4%,while0.5to5.1%ofvariantswerecalledasuniquetoeachpipeline.Indelconcordancewasonly26.8%betweenthreeindel-callingpipelines”
bcaools
• BCFtoolsisasetofu:li:esthatmanipulatevariantcallsintheVariantCallFormat(VCF)anditsbinarycounterpartBCF.
• Ack,moreformats!!!
VCF
• VariantCallFormat• Officialspec:hPp://samtools.github.io/hts-specs/VCFv4.2.pdf
• Headerlinesstar:ngwith#signs
• Lineswithvariantsaaerward
#####ReadReadRead
VCF(cont)• Tabdelimitedfields
– Chromosome– Loca:on– ID(ifthisisanamedvariant)– Referencesequence– Alternatesequence– Qualityscore– Filter(true/false–whetherornotitpassedfiltering)– Info–lotsofaddi:onalinfosuchasCIGARstring,depthacross
differentsamples,etc.– Columnsfollowforeachgenotypeifavailable
• BCFisthecompressedbinaryformat– SAM<->BAM– VCF<->BCF
VCFExample
#CHROM 20POS 14370ID rs6054257REF GALT AQUAL 29FILTER PASSINFO NS=3;DP=14;AF=0.5;DB;H2FORMAT GT:GQ:DP:HQNA00001 0|0:48:1:51,51NA00002 1|0:48:8:51,51NA00003 1/1:43:5:.,.
bcaools• BcaoolscallsSNPsfrom
mpileupfileandmanipulatesvcf/bcfformaPedfiles
• Call–SNP/indelcalling• Filter–filterthevariantsby
quality• Merge–mergeVCFfiles
together• Consensus–resequencedan
individualandgeneratethereferencesequenceforthatindividual
• Stats-sta:s:cs• Convert–convertbetween
formats
Overview
Samtools
Bcaools
• WorkswithSAM/BAMfiles
• Producesmpileup
• CallSNPsfrommpileup• WorkswithVCF/BCFfiles
AlignmentData
VariantData
IGV• high-performancevisualiza:on
toolforinterac:veexplora:onoflarge,integratedgenomicdatasets
• Freebutrequiresone:meregistra:on
hPp://www.broadins:tute.org/igv/
• Visualizeslotsofdatatypes– NGSreadalignments– Geneannota:on– Variants– Etc.