10
1 Next-Generation Sequencing (NGS): Next-Generation Sequencing (NGS): An Overview An Overview Francesca D Ciccarelli BITS 2009, Mar 20th 2009 Mar 20th 2009 Keywords: Next generation sequencing; Massive parallel sequencing; Ultra-deep sequencing; Pyro-sequencing NGS in the Literature Francesca D. Francesca D. Ciccarelli Ciccarelli NIH Grant 454 Solexa SOLiD Next-Next “$100.000 Genome ”: “$10.000 Genome”: Sequencing of tumor genome collections SNP and disease-associated mutations Signs of natural selection within a population Raw Measure of human genetic variation “$1.000 Genome”: Personal genome Nat Rev Genet 5 (2004), pp. 335–344. Curr Opin Genet Dev. 2006 16(6):545-52. Feb 2004: NHGRI launched a grant application to develop next generation sequencing technologies NIH Genome Centers spend > $120 million/year on genome sequences The $1000 Genome Project Francesca D. Francesca D. Ciccarelli Ciccarelli

NGS in the Literature - hsanmartino.itbioinformatics.hsanmartino.it/bits09/slides/Tutorial1-Ciccarelli.pdf · 4 SOLiD Emulsion PCR Transfer on solid array Enrichment for DNA-positive

  • Upload
    buidieu

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

1

Next-Generation Sequencing (NGS):Next-Generation Sequencing (NGS):

An OverviewAn Overview

Francesca D Ciccarelli

BITS 2009,, Mar 20th 2009Mar 20th 2009

Keywords: Next generation sequencing; Massive parallel sequencing; Ultra-deepsequencing; Pyro-sequencing

NGS in the Literature

Francesca D. Francesca D. CiccarelliCiccarelli

NIH Grant

454Solexa

SOLiD

Next-Next

• “$100.000 Genome ”:

• “$10.000 Genome”: Sequencing of tumor genome collectionsSNP and disease-associated mutationsSigns of natural selection within a population

Raw Measure of human genetic variation

• “$1.000 Genome”: Personal genome

Nat Rev Genet 5 (2004), pp. 335–344.Curr Opin Genet Dev. 2006 16(6):545-52.

Feb 2004: NHGRI launched a grant application to develop nextgeneration sequencing technologies

NIH Genome Centers spend > $120 million/year on genome sequences

The $1000 Genome Project

Francesca D. Francesca D. CiccarelliCiccarelli

2

I. Library Preparation 1. Shearing of DNA 2. Insertion of Fragments into a Plasmid 3. Transformation 4. Subcloning of Sheared Fragment 5. Colony Picking

Based on the protocol used at JGI (http://www.jgi.doe.gov/)

II. Sequencing 6. Cell Lysing 7. Rolling-Circle Amplification 8. Capillary Sequencing

III. Assembly and QA 9. Assembly 10. Quality Assessment

Traditional Genome Sequencing

Francesca D. Francesca D. CiccarelliCiccarelli

PROBLEM #3: costs

PROBLEM #1: in vivo cloning

PROBLEM #2: timing and workload

• 10.000 instrument day/human genome• Only affordable by genome centers

• Human Genome Reference Sequence (2001; 2003)$1 billion (99.995% accurate; 99% complete)

• Estimated Cost for Individual Genome:$10 million (1 year using >30 instruments)

Clonal Bias and Unclonable DNA• Hard Stops (hairpins, triple helices, stem loops, high GC content)• Polymerase (Long streches of PolyA)

Limitations of Traditional Sequencing

Francesca D. Francesca D. CiccarelliCiccarelli

Journal of Experimental Biology 210, 1518-1525 (2007)

Sequencing Approaches

Francesca D. Francesca D. CiccarelliCiccarelli

3

SOLiD (ABI)

Resolve inherent biases of in vivo cloning

454 (Roche)- emulsion PCR

- pyrosequencing

- read lengths ca. 400 bp

Solexa (Illumina)- PCR on solid support

- reversible terminator sequencing

- read lengths ca. 35 bp

- emulsion PCR

- sequencing by ligation

- read lengths ca. 35 bp

first instrument in June 2007

first instrument in Oct 2005

first instrument in July 2006

NGS on the Market

Francesca D. Francesca D. CiccarelliCiccarelli

454 Solexa

SOLiD

http://solid.appliedbiosystems.com

Shendure et al. (2005) Science 309 (1728-1732)Margulies et al. (2005) Nature 437 (376-380)

Library Preparation

Francesca D. Francesca D. CiccarelliCiccarelli

DNA attachedto the surface

Solexa

BridgeAmplification

Double StrandDenaturation

RepeatCycles

454

Anneal sstDNA

Emulsion in water-in-oil microreactors

Clonal Amplification

Enrichment for DNA-positive beads

Clonal Amplification

Francesca D. Francesca D. CiccarelliCiccarelli

4

SOLiD

Emulsion PCR

Transfer on solid arrayEnrichment for DNA-

positive beads

Clonal Amplification

Clonal Amplification

Francesca D. Francesca D. CiccarelliCiccarelli

454 (Pyrosequencing)

Reaction with:• DNA polymerase, • ATP sulfurylase, • luciferase • apyrase, • APS• luciferin

Addition of dNTPs one at the time

Solexa (Reversible Terminators)

Reaction with:• DNA polymerase, • primers • 4 labelled reversible terminators

Determine first base using laser light

Wash off and Repeat for all sequence

Sequence Read

Sequencing by Synthesis

Francesca D. Francesca D. CiccarelliCiccarelli

1st Cycle Sequencing

Probe Annealing Ligation

Washing offVisualization

SOLiDReaction with:• Universal Primers • Ligase • Probes

Cleavage

[x 5 times]5 base read

Sequencing by Ligation

Francesca D. Francesca D. CiccarelliCiccarelli

5

SOLiD

Following Cycles

annealing; ligation; washing;visualization; cleavage

Reset

[x 5 times]

[X 5 times]

Sequencing by Ligation

Francesca D. Francesca D. CiccarelliCiccarelli

Capillary Electrophoresis: 1be

Deconvolution matrix

• A single color does not indicateone single base• Each read contains informationfor 2 bases• To decode the bases you have toknow one of them

SNP Detection

Real SNP Miscall

Sequencing by Ligation: 2be

Two Bases Encoding

Francesca D. Francesca D. CiccarelliCiccarelli

Massive Parallelization

Francesca D. Francesca D. CiccarelliCiccarelli

Sequencing Reaction withinthe PicoTiterPlate Device

• 1.6 million wells/plate• ~420 kread/run (1.2 Mread/run)

454 Solexa

• > 10 million clusters• ~50 Mread/run (220 Mread/run)

Sequencing Reaction on planar,optically transparent surface

SOLiD • ~ 20 million beads (1µm diameter)• ~95 Mread/run (220 Mread/run)

present (6months/1year)

6

PROBLEM #3: costs

PROBLEM #1: in vivo cloning

PROBLEM #2: timing and workload

• 10.000 instrument day/human genome• Only affordable by genome centers

• Human Genome Reference Sequence (2001; 2003)$1 billion (99.995% accurate; 99% complete)

• Estimated Cost for Individual Genome:$10 million (1 year using >30 instruments)

Clonal Bias and Unclonable DNA• Hard Stops (hairpins, triple helices, stem loops, high GC content)• Polymerase (Long streches of PolyA)

Limitations of Traditional Sequencing

Francesca D. Francesca D. CiccarelliCiccarelli

1 Roche-Italy, pers.comm.

3 AppliedBS, pers.comm.Today (6Months-1Year)2 Illumina, pers.comm.

- 3 days (single fragment)- 6 days (paired-end)

- 6Gbp/run (17Gbp)- 10Gbp/run (26Gbp)

SOLiD3

- 4 days (single fragment)- 4.5 days (paired-end)

- 1.5Gbp/run (5-10Gbp)- 3Gbp/run (10-20Gbp)

Solexa2

21h- 100Mbp/run (450Mbp-1Gbp)4541

10 runs/day- 96cap: 76.8kbp/run- 384cap 0.3Mbp/run

Sanger

TIMINGTHROUGHPUT

Timing and Throughput

Francesca D. Francesca D. CiccarelliCiccarelli

PROBLEM #3: costs

PROBLEM #1: in vivo cloning

PROBLEM #2: timing and workload

• 10.000 instrument day/human genome• Only affordable by genome centers

• Human Genome Reference Sequence (2001; 2003)$1 billion (99.995% accurate; 99% complete)

• Estimated Cost for Individual Genome:$10 million (1 year using >30 instruments)

Clonal Bias and Unclonable DNA• Hard Stops (hairpins, triple helices, stem loops, high GC content)• Polymerase (Long streches of PolyA)

Limitations of Traditional Sequencing

Francesca D. Francesca D. CiccarelliCiccarelli

7

~€ 0.0004/kbp€2.478 (6Gbp)SOLiD3

~€ 0.002/kbp€3.000 (1.5Gbp)Solexa2

- ~€ 0.07/kbp- ~€ $9 (consensus kbp, Error=4x10-5)2

€7.195 (100Mb)4541

- $1 (raw kbp)- $7 (consensus kb, Error=4x10-6)2

X96x384

Sanger

COSTS/kBPCOSTS/RUN

1 Roche-Italy, pers.comm.

4 AppliedBS, pers.comm.

3 Illumina, pers.comm.

2 G.Church Nat Biotec 24, 139 (2006)

Sequencing Costs

Francesca D. Francesca D. CiccarelliCiccarelli

Limitations of NGS

Francesca D. Francesca D. CiccarelliCiccarelli

PROBLEM #3: sequencing accuracy

PROBLEM #1: length of sequencing reads

PROBLEM #2: (huge) amount of data production

• Difficult data handling and analysis

• Not a reliable standard available yet• Difficult to compare different methods

• Much shorter than Sanger

25-35 (50)SOLiD

35 (75)Solexa

250 (450)454

450-850Sanger

Read Length (bp)

- Resequencing (454) : overlapping amplicons needed

- De novo Sequencing: difficult assembly

- Metagenomics: difficult assignment

Length of Sequencing Reads

Francesca D. Francesca D. CiccarelliCiccarelli

8

- Increase the Read Length

- Help in Assembly Reconstruction

- Find Structural Variation (CNV, Rearrangements, Etc)

600-10.000bpSOLiD

200-300bp (2kbp)Solexa

1,5-3kbp (16kbp)454INSERT LENGTH

Francesca D. Francesca D. CiccarelliCiccarelli

Paired-End Sequencing

tag1 tag2 insert

Need ad hoc tool development for data analysis

40Gb (7-8 Tb after sequencing)SOLiD

10-20Gb (0.5-2Tb after sequencing)Solexa

1Gb (10-15Gb after sequencing)454

(Huge) Amount of Data Production

Francesca D. Francesca D. CiccarelliCiccarelli

Difficult to compare because based on different technologies

99.999% (15x) $(in principle, outstanding accuracy for SNPdetection)

99.94%SOLiD4

99.8%-98.5%Solexa3

99.99%99.96%

99.5% (no homopolymers)

97.0% (with homopolymers)(n>7; 0.7% human genome)2

4541

99.995% (10x)99.5%Sanger

ConsensusRaw

1 Roche-Italy, pers.comm.4 AppliedBS, pers.comm.

3 Illumina, pers.comm.2 Nat Med 12, 852-855 (2006)

Sequencing Accuracy

Francesca D. Francesca D. CiccarelliCiccarelli

9

Francesca D. Francesca D. CiccarelliCiccarelli

Comparison of Sequencing Accuracy- Generation of a mutant strain of Pichia stipitis (haploid yeast,

GS=15.4Mb, 14 mutations compared to reference)

Genome Res. (2008) 18:1638-1642

- Whole-genome mutational profiling using 454, Solid, Illumina

- Comparative assessment of the sequence coverage needed to optimize sensitivity and specificity

• NOV 2008: Genome of an Asian Individual (Illumina) - ~$1 million- few weeks- Nature (2008) 456: 60-65

• MAY 2007: James Watson's genome (454)- less than $1 million- few months- Nature (2008) 452, 872-876

• SEP 2007: Diploid Genome of Craig Venter (Sanger)- ~$1 million- few months- PLoS Biology (2007) 5(10): e254

Human Genome Resequencing

Francesca D. Francesca D. CiccarelliCiccarelli

• NOV 2008: Genome of a male Yorouba (Illumina) - US$100,000- few weeks- Nature (2008) 456: 60-65

Single-molecule Sequencing

LIBRARY PREPARATION- DNA fragmentation andaddition of labeled polyA -

SURFACE BINDING- through hybridization with

complementary polyT -

IMAGING- to establish starting sites

of sequencing -

SEQUENCING- after adding labeled nu and

polymerase and followed by washing and imaging -

CHEMICAL CLEAVAGE- after washing

to remove the dye -

SECOND CYCLE- with another labeled nu -

Harris et al Science 320 (2008)

NextNext-Next Generation Sequencing-Next Generation Sequencing

10

Helicos Web Site

individual single molecules thatincorporated a fluorescent �G�nucleotide in this cycle

- At each cycle, the incorporated nucleotides emits light upon illumination

Tracking nucleotide incorporation on each strand determines theexact sequence of each individual DNA molecule

- The sequencer captures images for each strand up to 25 nu

NextNext-Next Generation Sequencing-Next Generation Sequencing

- NO PCR- PCR introduces an uncontrolled bias in template representationsince amplification efficiency varies as a function of templateproperties- Errors of polymerase

- LOWER RUNNING COSTS

- HIGHER THROUGHPUT

Helicos Web Site

NextNext-Next Generation Sequencing-Next Generation Sequencing

Engineered DNA polymerases Fluorescent nucleotide

During the chain extension, when a nu isincorporated, there is energy transfer fromthe polymerase to the nu, which emits light

http://visigenbio.com/technology_movie_streaming.html

Visigen

Single Molecule SequencingSingle Molecule Sequencing