Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Aligning reads to a genomeAnalysis of Next-Generation Sequencing Data
Luce Skrabanek
Applied Bioinformatics Core
Slides at https://bit.ly/2T3sjRg1
11 February, 2020
1https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 1 / 29
1 Why do we align?
2 What do we align to?
3 How do we align?
4 Output files
5 References
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 2 / 29
Why do we align?
Why do we align?
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 3 / 29
Why do we align?
What do we learn?
[Reinert et al., 2015, Pfeifer, 2017]Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 4 / 29
What do we align to?
What do we align to?
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 5 / 29
What do we align to?
What do we need?
Reference sequence: the nucleotide sequence of the chromosomes of aspecies 2
Optional annotations: the gene/transcript models for a genome;includes the coordinates of the exons of a transcript on a referencegenome, optionally the strand, gene name, coding portion of thetranscript.
2see discussion on reference genomes in [Ballouz et al., 2019]Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 6 / 29
What do we align to?
Sources for reference genomes
EnsemblI http://www.ensembl.org
UCSCI https://genome.ucsc.edu/
NCBII https://www.ncbi.nlm.nih.gov/
GencodeI https://www.gencodegenes.org/
Organism-specific databasesI (e.g., http://toxodb.org/toxo/)
Always note the source and version of your reference genome.Look out for chromosome naming conventions.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 7 / 29
What do we align to?
Annotations
The chromosome names must match those in your reference genome;annotations must correspond to the same reference genome build as your
reference genome fasta file.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 8 / 29
What do we align to?
Gene models can vary dramatically
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 9 / 29
What do we align to?
Which annotation should you use?
[Jänes et al., 2015, Zhao and Zhang, 2015, Wu et al., 2013]Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 10 / 29
What do we align to?
Storing annotation information
Represent genome coordinates and gene descriptions/namesmultiple formats: GFF2, GFF3, GTF3, BED, SAF...
3http://genome.ucsc.edu/FAQ/FAQformat#format4Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 11 / 29
How do we align?
How do we align?
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 12 / 29
How do we align?
Aligners
Genomic alignersI BWA [Li and Durbin, 2009],
Bowtie2Splice-aware aligners
I STAR [Dobin et al., 2013],TopHat, HiSAT2
Pseudo alignmentI Salmon, kallisto, RSEM
Challenge
Mapping millions of readsaccurately and in a reasonable
amount of time, despitecomplications from sequencingerrors, genomic variation and
repetitive elements.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 13 / 29
How do we align?
Genomic aligner: BWA
BWA uses a canonical seed-and-extend paradigm. BWA is based on theBurrows-Wheeler Transform and uses the FM-index4 to search for exactstring matches.
This has a very small memory footprint.
4Full-text Minute-space, or Ferragina and Manzini [Ferragina and Manzini, 2010]Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 14 / 29
How do we align?
FM-index backwards search
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 15 / 29
How do we align?
BWA-MEM
BWA MEM [Li, 2013] is the next generation in the BWA family, and is oneof the few that works well for both 70bp reads and long sequences up to afew megabases.
1 allows long gaps2 the allowable error rate adjusts with sequence length3 can report multiple non-overlapping local hits
As for BWA, uses a canonical seed-and-extend paradigm, grouping seeds that arecolinear and close to each other as a chain.Each seed is extended using a banded affine-gap-penalty dynamic programming,stopping when the difference between the best and the current extension score isabove some threshold, avoiding extension through poorly aligned regionsKeep track of the best extension score reaching the end of the query sequence. Ifthe difference between the best score reaching the end and the best local alignmentscore is below a threshold, the local alignment will be rejected even if it has a higherscore.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 16 / 29
How do we align?
Mapping to the transcriptome
1 Alignment of exon-exon spanning reads2 Multiple isoforms3 Identification of novel splice junctions
STAR uses an indexed suffix array [generated usingboth the genomic sequence, and the sequencespanning known exon-exon boundaries(transcriptome)], to find MMPs (longest possibleperfect matches), identifies "anchor alignments", andstitches them together.
STAR can also identify novel junctions, if it findsenough reads as support. Users can define how manyreads must span a novel junction, and how manybases must be covered on either side of the junction.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 17 / 29
How do we align?
Splice-aware aligner: STAR [Spliced Transcripts Alignmentto a Reference]
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 18 / 29
How do we align?
Running STAR
STAR has many parameters (familiarize yourself with the manual)! See [Ballouz et al.,2018] for a discussion of how parameter selection affects mapping (e.g., handling ofmulti-mapped reads, intron sizes).
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 19 / 29
Output files
Output files
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 20 / 29
Output files
SAM files
Each line of the optionalheader section starts with @,and includes informationsuch as chromosomes names(SN) and their lengths (LN).The vast majority of lineswithin a SAM file arecompact representations ofthe read alignments whereeach read is described by the11 mandatory entries and avariable number of optionalfields [Li et al., 2009].
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 21 / 29
Output files
SAM FLAG field
The FLAG field includesinformation about themapping of the individualread. It is a bitwise flag,compactly storing answersto multiple binary Yes/Noquestions as a short seriesof bits where each of thesingle bits can beaddressed separately.
See https://broadinstitute.github.io/picard/explain-flags.html to interpret bit flag values.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 22 / 29
Output files
CIGAR [Concise Idiosyncratic Gapped Alignment Reportstring]
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 23 / 29
Output files
SAM OPT field
The number of optional SAM/BAM fields, their value types and the information storedwithin them depends on the alignment program and can vary substantially.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 24 / 29
Output files
Exploring SAM/BAM files
The most widely used tool to explore and manipulate SAM/BAM files issamtools.There are many options to subset reads based on SAM fields such aschromosomal location, or FLAG value, or mapping quality.samtools view <in.bam>Use egrep to subset reads based on the optional tags.Most downstream applications also require the BAM file to be indexed byreference sequence position, to allow the efficient retrieval of all readsaligning to a locus.samtools index <in.bam>
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 25 / 29
References
References
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 26 / 29
References
Sara Ballouz, Alexander Dobin, Thomas R Gingeras, and Jesse Gillis. Thefractured landscape of RNA-seq alignment: the default in our STARs.Nucleic Acids Research, 46(10):5125–5138, 05 2018. doi:10.1093/nar/gky325. URL https://dx.doi.org/10.1093/nar/gky325.
Sara Ballouz, Alexander Dobin, and Jesse Gillis. Is it time to change thereference genome? bioRxiv, 2019. doi: 10.1101/533166. URLhttps://www.biorxiv.org/content/early/2019/01/29/533166.
Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, ChrisZaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R.Gingeras. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. doi: 10.1093/bioinformatics/bts635.
Paolo Ferragina and Giovanni Manzini. Opportunistic Data Structures withApplications. Technical report, 2010.
Jürgen Jänes, Fengyuan Hu, Alexandra Lewin, and Ernest Turro. Acomparative study of RNA-seq analysis strategies. Briefings inBioinformatics, (January):1–9, 2015. ISSN 1467-5463. doi:10.1093/bib/bbv007.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 27 / 29
References
Heng Li. Aligning sequence reads, clone sequences and assembly contigswith BWA-MEM. arXiv e-prints, art. arXiv:1303.3997, March 2013.
Heng Li and Richard Durbin. Fast and accurate short read alignment withBurrows-Wheeler transform. Bioinformatics, 25(14):1754–1760, 05 2009.doi: 10.1093/bioinformatics/btp324.
Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer,Gabor Marth, Goncalo Abecasis, and Richard Durbin. The SequenceAlignment/Map format and SAMtools. Bioinformatics, 25(16):2078–9,August 2009. ISSN 1367-4811. doi: 10.1093/bioinformatics/btp352.
S.P. Pfeifer. From next-generation resequencing reads to a high-qualityvariant data set. Heredity, 118(2):111–124, 2017. doi:10.1038/hdy.2016.102.
Knut Reinert, Ben Langmead, David Weese, and Dirk J. Evers. Alignmentof next-generation sequencing reads. Annual Review of Genomics andHuman Genetics, 16:133–151, 8 2015. doi:10.1146/annurev-genom-090413-025358.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 28 / 29
References
Po-Yen Wu, John H. Phan, and May D. Wang. Assessing the impact ofhuman genome annotation choice on RNA-seq expression estimates.BMC Bioinformatics, 14(11):S8, Nov 2013. doi:10.1186/1471-2105-14-S11-S8.
Shanrong Zhao and Baohong Zhang. A comprehensive evaluation ofEnsembl, RefSeq, and UCSC annotations in the context of RNA-seq readmapping and gene quantification. BMC Genomics, 16(1):97, Feb 2015.doi: 10.1186/s12864-015-1308-8.
Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 29 / 29