29
Aligning reads to a genome Analysis of Next-Generation Sequencing Data Luce Skrabanek Applied Bioinformatics Core Slides at https://bit.ly/2T3sjRg 1 11 February, 2020 1 https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/ Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 1 / 29

Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Aligning reads to a genomeAnalysis of Next-Generation Sequencing Data

Luce Skrabanek

Applied Bioinformatics Core

Slides at https://bit.ly/2T3sjRg1

11 February, 2020

1https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/schedule_2020/Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 1 / 29

Page 2: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

1 Why do we align?

2 What do we align to?

3 How do we align?

4 Output files

5 References

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 2 / 29

Page 3: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Why do we align?

Why do we align?

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 3 / 29

Page 4: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Why do we align?

What do we learn?

[Reinert et al., 2015, Pfeifer, 2017]Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 4 / 29

Page 5: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

What do we align to?

What do we align to?

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 5 / 29

Page 6: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

What do we align to?

What do we need?

Reference sequence: the nucleotide sequence of the chromosomes of aspecies 2

Optional annotations: the gene/transcript models for a genome;includes the coordinates of the exons of a transcript on a referencegenome, optionally the strand, gene name, coding portion of thetranscript.

2see discussion on reference genomes in [Ballouz et al., 2019]Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 6 / 29

Page 7: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

What do we align to?

Sources for reference genomes

EnsemblI http://www.ensembl.org

UCSCI https://genome.ucsc.edu/

NCBII https://www.ncbi.nlm.nih.gov/

GencodeI https://www.gencodegenes.org/

Organism-specific databasesI (e.g., http://toxodb.org/toxo/)

Always note the source and version of your reference genome.Look out for chromosome naming conventions.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 7 / 29

Page 8: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

What do we align to?

Annotations

The chromosome names must match those in your reference genome;annotations must correspond to the same reference genome build as your

reference genome fasta file.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 8 / 29

Page 9: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

What do we align to?

Gene models can vary dramatically

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 9 / 29

Page 10: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

What do we align to?

Which annotation should you use?

[Jänes et al., 2015, Zhao and Zhang, 2015, Wu et al., 2013]Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 10 / 29

Page 11: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

What do we align to?

Storing annotation information

Represent genome coordinates and gene descriptions/namesmultiple formats: GFF2, GFF3, GTF3, BED, SAF...

3http://genome.ucsc.edu/FAQ/FAQformat#format4Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 11 / 29

Page 12: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

How do we align?

How do we align?

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 12 / 29

Page 13: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

How do we align?

Aligners

Genomic alignersI BWA [Li and Durbin, 2009],

Bowtie2Splice-aware aligners

I STAR [Dobin et al., 2013],TopHat, HiSAT2

Pseudo alignmentI Salmon, kallisto, RSEM

Challenge

Mapping millions of readsaccurately and in a reasonable

amount of time, despitecomplications from sequencingerrors, genomic variation and

repetitive elements.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 13 / 29

Page 14: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

How do we align?

Genomic aligner: BWA

BWA uses a canonical seed-and-extend paradigm. BWA is based on theBurrows-Wheeler Transform and uses the FM-index4 to search for exactstring matches.

This has a very small memory footprint.

4Full-text Minute-space, or Ferragina and Manzini [Ferragina and Manzini, 2010]Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 14 / 29

Page 15: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

How do we align?

FM-index backwards search

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 15 / 29

Page 16: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

How do we align?

BWA-MEM

BWA MEM [Li, 2013] is the next generation in the BWA family, and is oneof the few that works well for both 70bp reads and long sequences up to afew megabases.

1 allows long gaps2 the allowable error rate adjusts with sequence length3 can report multiple non-overlapping local hits

As for BWA, uses a canonical seed-and-extend paradigm, grouping seeds that arecolinear and close to each other as a chain.Each seed is extended using a banded affine-gap-penalty dynamic programming,stopping when the difference between the best and the current extension score isabove some threshold, avoiding extension through poorly aligned regionsKeep track of the best extension score reaching the end of the query sequence. Ifthe difference between the best score reaching the end and the best local alignmentscore is below a threshold, the local alignment will be rejected even if it has a higherscore.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 16 / 29

Page 17: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

How do we align?

Mapping to the transcriptome

1 Alignment of exon-exon spanning reads2 Multiple isoforms3 Identification of novel splice junctions

STAR uses an indexed suffix array [generated usingboth the genomic sequence, and the sequencespanning known exon-exon boundaries(transcriptome)], to find MMPs (longest possibleperfect matches), identifies "anchor alignments", andstitches them together.

STAR can also identify novel junctions, if it findsenough reads as support. Users can define how manyreads must span a novel junction, and how manybases must be covered on either side of the junction.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 17 / 29

Page 18: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

How do we align?

Splice-aware aligner: STAR [Spliced Transcripts Alignmentto a Reference]

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 18 / 29

Page 19: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

How do we align?

Running STAR

STAR has many parameters (familiarize yourself with the manual)! See [Ballouz et al.,2018] for a discussion of how parameter selection affects mapping (e.g., handling ofmulti-mapped reads, intron sizes).

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 19 / 29

Page 20: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Output files

Output files

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 20 / 29

Page 21: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Output files

SAM files

Each line of the optionalheader section starts with @,and includes informationsuch as chromosomes names(SN) and their lengths (LN).The vast majority of lineswithin a SAM file arecompact representations ofthe read alignments whereeach read is described by the11 mandatory entries and avariable number of optionalfields [Li et al., 2009].

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 21 / 29

Page 22: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Output files

SAM FLAG field

The FLAG field includesinformation about themapping of the individualread. It is a bitwise flag,compactly storing answersto multiple binary Yes/Noquestions as a short seriesof bits where each of thesingle bits can beaddressed separately.

See https://broadinstitute.github.io/picard/explain-flags.html to interpret bit flag values.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 22 / 29

Page 23: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Output files

CIGAR [Concise Idiosyncratic Gapped Alignment Reportstring]

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 23 / 29

Page 24: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Output files

SAM OPT field

The number of optional SAM/BAM fields, their value types and the information storedwithin them depends on the alignment program and can vary substantially.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 24 / 29

Page 25: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

Output files

Exploring SAM/BAM files

The most widely used tool to explore and manipulate SAM/BAM files issamtools.There are many options to subset reads based on SAM fields such aschromosomal location, or FLAG value, or mapping quality.samtools view <in.bam>Use egrep to subset reads based on the optional tags.Most downstream applications also require the BAM file to be indexed byreference sequence position, to allow the efficient retrieval of all readsaligning to a locus.samtools index <in.bam>

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 25 / 29

Page 26: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

References

References

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 26 / 29

Page 27: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

References

Sara Ballouz, Alexander Dobin, Thomas R Gingeras, and Jesse Gillis. Thefractured landscape of RNA-seq alignment: the default in our STARs.Nucleic Acids Research, 46(10):5125–5138, 05 2018. doi:10.1093/nar/gky325. URL https://dx.doi.org/10.1093/nar/gky325.

Sara Ballouz, Alexander Dobin, and Jesse Gillis. Is it time to change thereference genome? bioRxiv, 2019. doi: 10.1101/533166. URLhttps://www.biorxiv.org/content/early/2019/01/29/533166.

Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, ChrisZaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R.Gingeras. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. doi: 10.1093/bioinformatics/bts635.

Paolo Ferragina and Giovanni Manzini. Opportunistic Data Structures withApplications. Technical report, 2010.

Jürgen Jänes, Fengyuan Hu, Alexandra Lewin, and Ernest Turro. Acomparative study of RNA-seq analysis strategies. Briefings inBioinformatics, (January):1–9, 2015. ISSN 1467-5463. doi:10.1093/bib/bbv007.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 27 / 29

Page 28: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

References

Heng Li. Aligning sequence reads, clone sequences and assembly contigswith BWA-MEM. arXiv e-prints, art. arXiv:1303.3997, March 2013.

Heng Li and Richard Durbin. Fast and accurate short read alignment withBurrows-Wheeler transform. Bioinformatics, 25(14):1754–1760, 05 2009.doi: 10.1093/bioinformatics/btp324.

Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer,Gabor Marth, Goncalo Abecasis, and Richard Durbin. The SequenceAlignment/Map format and SAMtools. Bioinformatics, 25(16):2078–9,August 2009. ISSN 1367-4811. doi: 10.1093/bioinformatics/btp352.

S.P. Pfeifer. From next-generation resequencing reads to a high-qualityvariant data set. Heredity, 118(2):111–124, 2017. doi:10.1038/hdy.2016.102.

Knut Reinert, Ben Langmead, David Weese, and Dirk J. Evers. Alignmentof next-generation sequencing reads. Annual Review of Genomics andHuman Genetics, 16:133–151, 8 2015. doi:10.1146/annurev-genom-090413-025358.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 28 / 29

Page 29: Aligning reads to a genome - Analysis of Next-Generation … · 2020. 4. 14. · Whydowealign? Whatdowelearn? [Reinertetal.,2015,Pfeifer,2017] LuceSkrabanek (ABC,WCM) Aligningreadstoagenome

References

Po-Yen Wu, John H. Phan, and May D. Wang. Assessing the impact ofhuman genome annotation choice on RNA-seq expression estimates.BMC Bioinformatics, 14(11):S8, Nov 2013. doi:10.1186/1471-2105-14-S11-S8.

Shanrong Zhao and Baohong Zhang. A comprehensive evaluation ofEnsembl, RefSeq, and UCSC annotations in the context of RNA-seq readmapping and gene quantification. BMC Genomics, 16(1):97, Feb 2015.doi: 10.1186/s12864-015-1308-8.

Luce Skrabanek (ABC, WCM) Aligning reads to a genome 11 February, 2020 29 / 29