Upload
buidieu
View
215
Download
0
Embed Size (px)
Citation preview
1
Next-Generation Sequencing (NGS):Next-Generation Sequencing (NGS):
An OverviewAn Overview
Francesca D Ciccarelli
BITS 2009,, Mar 20th 2009Mar 20th 2009
Keywords: Next generation sequencing; Massive parallel sequencing; Ultra-deepsequencing; Pyro-sequencing
NGS in the Literature
Francesca D. Francesca D. CiccarelliCiccarelli
NIH Grant
454Solexa
SOLiD
Next-Next
• “$100.000 Genome ”:
• “$10.000 Genome”: Sequencing of tumor genome collectionsSNP and disease-associated mutationsSigns of natural selection within a population
Raw Measure of human genetic variation
• “$1.000 Genome”: Personal genome
Nat Rev Genet 5 (2004), pp. 335–344.Curr Opin Genet Dev. 2006 16(6):545-52.
Feb 2004: NHGRI launched a grant application to develop nextgeneration sequencing technologies
NIH Genome Centers spend > $120 million/year on genome sequences
The $1000 Genome Project
Francesca D. Francesca D. CiccarelliCiccarelli
2
I. Library Preparation 1. Shearing of DNA 2. Insertion of Fragments into a Plasmid 3. Transformation 4. Subcloning of Sheared Fragment 5. Colony Picking
Based on the protocol used at JGI (http://www.jgi.doe.gov/)
II. Sequencing 6. Cell Lysing 7. Rolling-Circle Amplification 8. Capillary Sequencing
III. Assembly and QA 9. Assembly 10. Quality Assessment
Traditional Genome Sequencing
Francesca D. Francesca D. CiccarelliCiccarelli
PROBLEM #3: costs
PROBLEM #1: in vivo cloning
PROBLEM #2: timing and workload
• 10.000 instrument day/human genome• Only affordable by genome centers
• Human Genome Reference Sequence (2001; 2003)$1 billion (99.995% accurate; 99% complete)
• Estimated Cost for Individual Genome:$10 million (1 year using >30 instruments)
Clonal Bias and Unclonable DNA• Hard Stops (hairpins, triple helices, stem loops, high GC content)• Polymerase (Long streches of PolyA)
Limitations of Traditional Sequencing
Francesca D. Francesca D. CiccarelliCiccarelli
Journal of Experimental Biology 210, 1518-1525 (2007)
Sequencing Approaches
Francesca D. Francesca D. CiccarelliCiccarelli
3
SOLiD (ABI)
Resolve inherent biases of in vivo cloning
454 (Roche)- emulsion PCR
- pyrosequencing
- read lengths ca. 400 bp
Solexa (Illumina)- PCR on solid support
- reversible terminator sequencing
- read lengths ca. 35 bp
- emulsion PCR
- sequencing by ligation
- read lengths ca. 35 bp
first instrument in June 2007
first instrument in Oct 2005
first instrument in July 2006
NGS on the Market
Francesca D. Francesca D. CiccarelliCiccarelli
454 Solexa
SOLiD
http://solid.appliedbiosystems.com
Shendure et al. (2005) Science 309 (1728-1732)Margulies et al. (2005) Nature 437 (376-380)
Library Preparation
Francesca D. Francesca D. CiccarelliCiccarelli
DNA attachedto the surface
Solexa
BridgeAmplification
Double StrandDenaturation
RepeatCycles
454
Anneal sstDNA
Emulsion in water-in-oil microreactors
Clonal Amplification
Enrichment for DNA-positive beads
Clonal Amplification
Francesca D. Francesca D. CiccarelliCiccarelli
4
SOLiD
Emulsion PCR
Transfer on solid arrayEnrichment for DNA-
positive beads
Clonal Amplification
Clonal Amplification
Francesca D. Francesca D. CiccarelliCiccarelli
454 (Pyrosequencing)
Reaction with:• DNA polymerase, • ATP sulfurylase, • luciferase • apyrase, • APS• luciferin
Addition of dNTPs one at the time
Solexa (Reversible Terminators)
Reaction with:• DNA polymerase, • primers • 4 labelled reversible terminators
Determine first base using laser light
Wash off and Repeat for all sequence
Sequence Read
Sequencing by Synthesis
Francesca D. Francesca D. CiccarelliCiccarelli
1st Cycle Sequencing
Probe Annealing Ligation
Washing offVisualization
SOLiDReaction with:• Universal Primers • Ligase • Probes
Cleavage
[x 5 times]5 base read
Sequencing by Ligation
Francesca D. Francesca D. CiccarelliCiccarelli
5
SOLiD
Following Cycles
annealing; ligation; washing;visualization; cleavage
Reset
[x 5 times]
[X 5 times]
Sequencing by Ligation
Francesca D. Francesca D. CiccarelliCiccarelli
Capillary Electrophoresis: 1be
Deconvolution matrix
• A single color does not indicateone single base• Each read contains informationfor 2 bases• To decode the bases you have toknow one of them
SNP Detection
Real SNP Miscall
Sequencing by Ligation: 2be
Two Bases Encoding
Francesca D. Francesca D. CiccarelliCiccarelli
Massive Parallelization
Francesca D. Francesca D. CiccarelliCiccarelli
Sequencing Reaction withinthe PicoTiterPlate Device
• 1.6 million wells/plate• ~420 kread/run (1.2 Mread/run)
454 Solexa
• > 10 million clusters• ~50 Mread/run (220 Mread/run)
Sequencing Reaction on planar,optically transparent surface
SOLiD • ~ 20 million beads (1µm diameter)• ~95 Mread/run (220 Mread/run)
present (6months/1year)
6
PROBLEM #3: costs
PROBLEM #1: in vivo cloning
PROBLEM #2: timing and workload
• 10.000 instrument day/human genome• Only affordable by genome centers
• Human Genome Reference Sequence (2001; 2003)$1 billion (99.995% accurate; 99% complete)
• Estimated Cost for Individual Genome:$10 million (1 year using >30 instruments)
Clonal Bias and Unclonable DNA• Hard Stops (hairpins, triple helices, stem loops, high GC content)• Polymerase (Long streches of PolyA)
Limitations of Traditional Sequencing
Francesca D. Francesca D. CiccarelliCiccarelli
1 Roche-Italy, pers.comm.
3 AppliedBS, pers.comm.Today (6Months-1Year)2 Illumina, pers.comm.
- 3 days (single fragment)- 6 days (paired-end)
- 6Gbp/run (17Gbp)- 10Gbp/run (26Gbp)
SOLiD3
- 4 days (single fragment)- 4.5 days (paired-end)
- 1.5Gbp/run (5-10Gbp)- 3Gbp/run (10-20Gbp)
Solexa2
21h- 100Mbp/run (450Mbp-1Gbp)4541
10 runs/day- 96cap: 76.8kbp/run- 384cap 0.3Mbp/run
Sanger
TIMINGTHROUGHPUT
Timing and Throughput
Francesca D. Francesca D. CiccarelliCiccarelli
PROBLEM #3: costs
PROBLEM #1: in vivo cloning
PROBLEM #2: timing and workload
• 10.000 instrument day/human genome• Only affordable by genome centers
• Human Genome Reference Sequence (2001; 2003)$1 billion (99.995% accurate; 99% complete)
• Estimated Cost for Individual Genome:$10 million (1 year using >30 instruments)
Clonal Bias and Unclonable DNA• Hard Stops (hairpins, triple helices, stem loops, high GC content)• Polymerase (Long streches of PolyA)
Limitations of Traditional Sequencing
Francesca D. Francesca D. CiccarelliCiccarelli
7
~€ 0.0004/kbp€2.478 (6Gbp)SOLiD3
~€ 0.002/kbp€3.000 (1.5Gbp)Solexa2
- ~€ 0.07/kbp- ~€ $9 (consensus kbp, Error=4x10-5)2
€7.195 (100Mb)4541
- $1 (raw kbp)- $7 (consensus kb, Error=4x10-6)2
X96x384
Sanger
COSTS/kBPCOSTS/RUN
1 Roche-Italy, pers.comm.
4 AppliedBS, pers.comm.
3 Illumina, pers.comm.
2 G.Church Nat Biotec 24, 139 (2006)
Sequencing Costs
Francesca D. Francesca D. CiccarelliCiccarelli
Limitations of NGS
Francesca D. Francesca D. CiccarelliCiccarelli
PROBLEM #3: sequencing accuracy
PROBLEM #1: length of sequencing reads
PROBLEM #2: (huge) amount of data production
• Difficult data handling and analysis
• Not a reliable standard available yet• Difficult to compare different methods
• Much shorter than Sanger
25-35 (50)SOLiD
35 (75)Solexa
250 (450)454
450-850Sanger
Read Length (bp)
- Resequencing (454) : overlapping amplicons needed
- De novo Sequencing: difficult assembly
- Metagenomics: difficult assignment
Length of Sequencing Reads
Francesca D. Francesca D. CiccarelliCiccarelli
8
- Increase the Read Length
- Help in Assembly Reconstruction
- Find Structural Variation (CNV, Rearrangements, Etc)
600-10.000bpSOLiD
200-300bp (2kbp)Solexa
1,5-3kbp (16kbp)454INSERT LENGTH
Francesca D. Francesca D. CiccarelliCiccarelli
Paired-End Sequencing
tag1 tag2 insert
Need ad hoc tool development for data analysis
40Gb (7-8 Tb after sequencing)SOLiD
10-20Gb (0.5-2Tb after sequencing)Solexa
1Gb (10-15Gb after sequencing)454
(Huge) Amount of Data Production
Francesca D. Francesca D. CiccarelliCiccarelli
Difficult to compare because based on different technologies
99.999% (15x) $(in principle, outstanding accuracy for SNPdetection)
99.94%SOLiD4
99.8%-98.5%Solexa3
99.99%99.96%
99.5% (no homopolymers)
97.0% (with homopolymers)(n>7; 0.7% human genome)2
4541
99.995% (10x)99.5%Sanger
ConsensusRaw
1 Roche-Italy, pers.comm.4 AppliedBS, pers.comm.
3 Illumina, pers.comm.2 Nat Med 12, 852-855 (2006)
Sequencing Accuracy
Francesca D. Francesca D. CiccarelliCiccarelli
9
Francesca D. Francesca D. CiccarelliCiccarelli
Comparison of Sequencing Accuracy- Generation of a mutant strain of Pichia stipitis (haploid yeast,
GS=15.4Mb, 14 mutations compared to reference)
Genome Res. (2008) 18:1638-1642
- Whole-genome mutational profiling using 454, Solid, Illumina
- Comparative assessment of the sequence coverage needed to optimize sensitivity and specificity
• NOV 2008: Genome of an Asian Individual (Illumina) - ~$1 million- few weeks- Nature (2008) 456: 60-65
• MAY 2007: James Watson's genome (454)- less than $1 million- few months- Nature (2008) 452, 872-876
• SEP 2007: Diploid Genome of Craig Venter (Sanger)- ~$1 million- few months- PLoS Biology (2007) 5(10): e254
Human Genome Resequencing
Francesca D. Francesca D. CiccarelliCiccarelli
• NOV 2008: Genome of a male Yorouba (Illumina) - US$100,000- few weeks- Nature (2008) 456: 60-65
Single-molecule Sequencing
LIBRARY PREPARATION- DNA fragmentation andaddition of labeled polyA -
SURFACE BINDING- through hybridization with
complementary polyT -
IMAGING- to establish starting sites
of sequencing -
SEQUENCING- after adding labeled nu and
polymerase and followed by washing and imaging -
CHEMICAL CLEAVAGE- after washing
to remove the dye -
SECOND CYCLE- with another labeled nu -
Harris et al Science 320 (2008)
NextNext-Next Generation Sequencing-Next Generation Sequencing
10
Helicos Web Site
individual single molecules thatincorporated a fluorescent �G�nucleotide in this cycle
- At each cycle, the incorporated nucleotides emits light upon illumination
Tracking nucleotide incorporation on each strand determines theexact sequence of each individual DNA molecule
- The sequencer captures images for each strand up to 25 nu
NextNext-Next Generation Sequencing-Next Generation Sequencing
- NO PCR- PCR introduces an uncontrolled bias in template representationsince amplification efficiency varies as a function of templateproperties- Errors of polymerase
- LOWER RUNNING COSTS
- HIGHER THROUGHPUT
Helicos Web Site
NextNext-Next Generation Sequencing-Next Generation Sequencing
Engineered DNA polymerases Fluorescent nucleotide
During the chain extension, when a nu isincorporated, there is energy transfer fromthe polymerase to the nu, which emits light
http://visigenbio.com/technology_movie_streaming.html
Visigen
Single Molecule SequencingSingle Molecule Sequencing