41
Data analysis methods for next-generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July 14-15, 2008, Boston, MA

Data analysis methods for next-generation sequencing technologies

  • Upload
    dee

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

Data analysis methods for next-generation sequencing technologies. Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July 14-15, 2008, Boston, MA. T1. Roche / 454 FLX system. pyrosequencing technology variable read-length - PowerPoint PPT Presentation

Citation preview

Page 1: Data analysis methods for next-generation sequencing technologies

Data analysis methods for next-generation sequencing technologies

Gabor T. MarthBoston College Biology Department

Epigenomics & Sequencing MeetingJuly 14-15, 2008, Boston, MA

Page 2: Data analysis methods for next-generation sequencing technologies

T1. Roche / 454 FLX system

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size

Page 3: Data analysis methods for next-generation sequencing technologies

T2. Illumina / Solexa Genome Analyzer

• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation

Page 4: Data analysis methods for next-generation sequencing technologies

T3. AB / SOLiD system

A C G T

A

C

G

T

2nd Base

1st

Bas

e

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size

Page 5: Data analysis methods for next-generation sequencing technologies

T4. Helicos / Heliscope system

• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing

Page 6: Data analysis methods for next-generation sequencing technologies

A1. Variation discovery: SNPs and short-INDELs

1. sequence alignment

2. dealing with non-unique mapping

3. looking for allelic differences

Page 7: Data analysis methods for next-generation sequencing technologies

A2. Structural variation detection

• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

• copy number (for amplifications, deletions) from depth of read coverage

Page 8: Data analysis methods for next-generation sequencing technologies

A3. Identification of protein-bound DNA

genome sequence

aligned reads

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. Robertson et al. Nature Methods, 2007

Page 9: Data analysis methods for next-generation sequencing technologies

A4. Novel transcript discovery (genes)

Mortazavi et al. Nature Methods

Page 10: Data analysis methods for next-generation sequencing technologies

A5. Novel transcript discovery (miRNAs)

Ruby et al. Cell, 2006

Page 11: Data analysis methods for next-generation sequencing technologies

A6. Expression profiling by tag counting

aligned reads

aligned reads

Jones-Rhoads et al. PLoS Genetics, 2007

gene gene

Page 12: Data analysis methods for next-generation sequencing technologies

A7. De novo organismal genome sequencing

assembled sequence contigs

short reads

longer reads

read pairs

Lander et al. Nature 2001

Page 13: Data analysis methods for next-generation sequencing technologies

C1. Read length

read length [bp]0 100 200 300

~200-450 (var)

25-40 (fixed)

25-35 (fixed)

20-35 (var)

400

Page 14: Data analysis methods for next-generation sequencing technologies

When does read length matter?

• short reads often sufficient where the entire read length can be used for mapping:

SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)

• longer reads are needed where one must use parts of reads for mapping:

de novo sequencing

novel transcript discovery

aacttagacttacagacttacatacgta

Known exon 1 Known exon 2

accgattactatacta

Page 15: Data analysis methods for next-generation sequencing technologies

C2. Read error rate

• error rate dictates the stringency of the read mapper

• error rate typically 0.4 - 1%

• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned

0 1 20.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fra

ctio

n of

gen

ome

Number of mismatches allowed

Page 16: Data analysis methods for next-generation sequencing technologies

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 40

Position on Read

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

Err

or r

ate

Error rate grows with each cycle

• this phenomenon limits useful read length

Page 17: Data analysis methods for next-generation sequencing technologies

Substitutions vs. INDEL errors

Page 18: Data analysis methods for next-generation sequencing technologies

C3. Representational biases / library complexity

fragmentation biases

amplification biases

PCR

sequencing biases

sequencing

low/no representati

on high

representation

Page 19: Data analysis methods for next-generation sequencing technologies

Dispersal of read coverage

• this affects variation discovery (deeper starting read coverage is needed)• it should have major impact is on counting applications

Page 20: Data analysis methods for next-generation sequencing technologies

Amplification errors

many reads from clonal copies of a single fragment

• early PCR errors in “clonal” read copies lead to false positive allele calls

early amplification error gets propagated onto every clonal copy

Page 21: Data analysis methods for next-generation sequencing technologies

C4. Paired-end reads

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

Korbel et al. Science 2007

• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

Page 22: Data analysis methods for next-generation sequencing technologies

Technologies / properties / applications

  Technology

  Roche/454 Illumina/Solexa AB/SOLiD

Read properties      

Read length 200-450bp 20-50bp 25-50bp

Error rate <0.5% <1.0% <0.5%

Dominant error type INDEL SUB SUB

Quality values available yes yes not really

Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)

       

Applications      

SNP discovery ● ● ○

short-INDEL discovery   ● ○

SV discovery ○ ○ ●

CHIP-SEQ ○ ● ●

small RNA/gene discovery ○ ● ●

mRNA Xcript discovery ● ○ ○

Expression profiling ○ ● ●

De novo sequencing ● ? ?

Page 23: Data analysis methods for next-generation sequencing technologies

Resequencing-based SNP discovery

(iv) read assembly

REF

(iii) read mapping (pair-wise alignment to genome reference)

IND

(i) base calling

IND

(v) SNP calling

(vi) SNP validation

(ii) micro-repeat analysis

(vii) data viewing, hypothesis generation

Page 24: Data analysis methods for next-generation sequencing technologies

The “toolbox”

• base callers

• microrepeat finders

• read mappers

• SNP callers

• structural variation callers

• assembly viewers

Page 25: Data analysis methods for next-generation sequencing technologies

…AND they give you the cover on the box

Reference guided read mapping

Reference-sequence guided mapping:

…you get the pieces…

Some pieces are more unique than others

Page 26: Data analysis methods for next-generation sequencing technologies

MOSAIK: an anchored aligner / assembler

Step 1. initial short-hash scan for possible read locations

Step 2. evaluation of candidate locations with SW method

Michael Stromberg

Page 27: Data analysis methods for next-generation sequencing technologies

Non-unique mapping, gapped alignments

1. Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)

2. Gapped alignments: allow for mapping reads with insertion or deletion sequencing errors, and reads with bona fide INDEL alleles

Page 28: Data analysis methods for next-generation sequencing technologies

Read types aligned, paired-end read strategy

3. Aligns and co-assembles customary read types:ABI/capillaryIllumina/SolexaAB/SOLiDRoche/454Helicos/Heliscope

ABI/capillary

454 FLX

454 GS20

Illumina4. Paired-end read alignments

Page 29: Data analysis methods for next-generation sequencing technologies

Other mainstream read mappers

• ELAND (Tony Cox, Illumina)-- the “official” read mapper supplied by Illumina, fast

• MAQ (Li Heng + Richard Durbin, Sanger)-- the most widely used read mapper, low RAM footprint

• SOAP (Beijing Genomics Institute)-- a new mapper developed for human next-gen reads

• SHRIMP (Michael Brudno, University of Toronto)-- full Smith-Waterman

Page 30: Data analysis methods for next-generation sequencing technologies

Speed

Page 31: Data analysis methods for next-generation sequencing technologies

Polymorphism / mutation detection

sequencing error

polymorphism

Page 32: Data analysis methods for next-generation sequencing technologies

Determining genotype directly from sequence

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

A/C

C/C

A/A

Page 33: Data analysis methods for next-generation sequencing technologies

Software

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11GigaBayesGigaBayes

SNP

INS

Page 34: Data analysis methods for next-generation sequencing technologies

Data visualization

1. aid software development: integration of trace data viewing, fast navigation, zooming/panning

2. facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays

3. promote hypothesis generation: integration of annotation tracks

Weichun Huang

Page 35: Data analysis methods for next-generation sequencing technologies

Applications

1. SNP discovery in shallow, single-read 454 coverage(Drosophila melanogaster)

3. Mutational profiling in deep 454 and Illumina read data(Pichia stipitis)

2. SNP and INDEL discovery in deep Illumina short-read coverage(Caenorhabditis elegans)

(image from Nature Biotech.)

Page 36: Data analysis methods for next-generation sequencing technologies

Our software is available for testing

http://bioinformatics.bc.edu/marthlab/Beta_Release

Page 37: Data analysis methods for next-generation sequencing technologies

Credits

http://bioinformatics.bc.edu/marthlab

Elaine Mardis (Washington University)Andy Clark (Cornell University)Doug Smith (Agencourt)

Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

Page 38: Data analysis methods for next-generation sequencing technologies

Accuracy

• As is the case for all heuristic alignment algorithms accuracy and speed are option- and parameter-dependent

Page 39: Data analysis methods for next-generation sequencing technologies

C3. Quality values are important for allele calling

• PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles

• inaccurate or not well calibrated base quality values hinder allele calling

Q-values should be accurate … and high!

Page 40: Data analysis methods for next-generation sequencing technologies

Software tools for next-gen sequence analysis

Page 41: Data analysis methods for next-generation sequencing technologies

Next-generation sequencing technologies and applications