Evolution of DNA Sequencing - talk by Jonathan Eisen for the Bodega Workshop in Applied Phylogenetics

Slides for Jonathan Eisen talk at UC Davis Bodega Bay Workshop in Applied Phylogenetics

!Evolution of Sequencing

!Workshop in Applied Phylogenetics

March 9, 2014 Bodega Bay Marine Lab

!Jonathan A. Eisen

UC Davis Genome Center

http://phylogenomics.wordpress.com


Review Papers

Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem 2013;6:287-303. doi: 10.1146/annurev-anchem-062012-092628.


http://www.ncbi.nlm.nih.gov/pubmed?term=Mardis%20ER%5BAuthor%5D&cauthor=true&cauthor_uid=23560931


Review Papers

ANRV353-GG09-20 ARI 25 July 2008 14:57

Next-Generation DNASequencing MethodsElaine R. MardisDepartments of Genetics and Molecular Microbiology and Genome Sequencing Center,Washington University School of Medicine, St. Louis MO 63108; email: [email protected]

Annu. Rev. Genomics Hum. Genet. 2008.9:387–402

First published online as a Review in Advance onJune 24, 2008

The Annual Review of Genomics and Human Geneticsis online at genom.annualreviews.org

This article’s doi:10.1146/annurev.genom.9.081307.164359

Copyright c⃝ 2008 by Annual Reviews.All rights reserved

1527-8204/08/0922-0387$20.00

Key Wordsmassively parallel sequencing, sequencing-by-synthesis, resequencing

AbstractRecent scientific discoveries that resulted from the application of next-generation DNA sequencing technologies highlight the striking impactof these massively parallel platforms on genetics. These new meth-ods have expanded previously focused readouts from a variety of DNApreparation protocols to a genome-wide scale and have fine-tuned theirresolution to single base precision. The sequencing of RNA also hastransitioned and now includes full-length cDNA analyses, serial analysisof gene expression (SAGE)-based methods, and noncoding RNA dis-covery. Next-generation sequencing has also enabled novel applicationssuch as the sequencing of ancient DNA samples, and has substantiallywidened the scope of metagenomic analysis of environmentally derivedsamples. Taken together, an astounding potential exists for these tech-nologies to bring enormous change in genetic and biological researchand to enhance our fundamental biological knowledge.

387

Click here for quick links to Annual Reviews content online, including:

• Other articles in this volume• Top cited articles• Top downloaded articles• Our comprehensive search

FurtherANNUALREVIEWS

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.

ANRV353-GG09-20 ARI 25 July 2008 14:57

Next-Generation DNASequencing MethodsElaine R. MardisDepartments of Genetics and Molecular Microbiology and Genome Sequencing Center,Washington University School of Medicine, St. Louis MO 63108; email: [email protected]

Annu. Rev. Genomics Hum. Genet. 2008.9:387–402

First published online as a Review in Advance onJune 24, 2008

The Annual Review of Genomics and Human Geneticsis online at genom.annualreviews.org

This article’s doi:10.1146/annurev.genom.9.081307.164359

Copyright c⃝ 2008 by Annual Reviews.All rights reserved

1527-8204/08/0922-0387$20.00

Key Wordsmassively parallel sequencing, sequencing-by-synthesis, resequencing

AbstractRecent scientific discoveries that resulted from the application of next-generation DNA sequencing technologies highlight the striking impactof these massively parallel platforms on genetics. These new meth-ods have expanded previously focused readouts from a variety of DNApreparation protocols to a genome-wide scale and have fine-tuned theirresolution to single base precision. The sequencing of RNA also hastransitioned and now includes full-length cDNA analyses, serial analysisof gene expression (SAGE)-based methods, and noncoding RNA dis-covery. Next-generation sequencing has also enabled novel applicationssuch as the sequencing of ancient DNA samples, and has substantiallywidened the scope of metagenomic analysis of environmentally derivedsamples. Taken together, an astounding potential exists for these tech-nologies to bring enormous change in genetic and biological researchand to enhance our fundamental biological knowledge.

387

Click here for quick links to Annual Reviews content online, including:

• Other articles in this volume• Top cited articles• Top downloaded articles• Our comprehensive search

FurtherANNUALREVIEWS

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.



Open Access Papers of Interest

• http://www.microbialinformaticsj.com/content/2/1/3/ • http://www.hindawi.com/journals/bmri/2012/251364/abs/ • http://m.cancerpreventionresearch.aacrjournals.org/

content/5/7/887.full


http://m.cancerpreventionresearch.aacrjournals.org/content/5/7/887.full


Approaching to NGS

Discovery of DNA structure(Cold Spring Harb. Symp. Quant. Biol. 1953;18:123-31)

1953

Sanger sequencing method by F. Sanger(PNAS ,1977, 74: 560-564)

1977

PCR by K. Mullis(Cold Spring Harb Symp Quant Biol. 1986;51 Pt 1:263-73)

1983

Development of pyrosequencing(Anal. Biochem., 1993, 208: 171-175; Science ,1998, 281: 363-365)

1993

1980

1990

2000

2010

Single molecule emulsion PCR 1998

Human Genome Project(Nature , 2001, 409: 860–92; Science, 2001, 291: 1304–1351)

Founded 454 Life Science 2000

454 GS20 sequencer(First NGS sequencer) 2005

Founded Solexa 1998

Solexa Genome Analyzer(First short-read NGS sequencer) 2006

GS FLX sequencer(NGS with 400-500 bp read lenght) 2008

Hi-Seq2000(200Gbp per Flow Cell) 2010

Illumina acquires Solexa(Illumina enters the NGS business) 2006

ABI SOLiD(Short-read sequencer based upon ligation) 2007

Roche acquires 454 Life Sciences(Roche enters the NGS business) 2007

NGS Human Genome sequencing(First Human Genome sequencing based upon NGS technology) 2008

Miseq Roche Jr Ion Torrent PacBio Oxford

From Slideshare presentation of Cosentino Cristian http://www.slideshare.net/cosentia/high-throughput-equencing

Sequencing Technology Timeline


http://www.slideshare.net/cosentia/high-throughput-equencing


Generation I: Manual Sequencing



Maxam-Gilbert Sequencing

http://www.pnas.org/content/74/2/560.full.pdf


http://www.pnas.org/content/74/2/560.full.pdf


Sanger Sequencing of PhiX174

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC431765/


http://www.ncbi.nlm.nih.gov/pmc/articles/PMC431765/


Sanger Sequencing



Sanger Sequencing



Nobel Prize 1980: Berg, Gilbert, Sanger



Generation II: Automated Sanger



Automation of Sanger Part ISanger method with labeled dNTPs

The Sanger mehtods is based on the idea that inhibitors can terminate elongation of DNA at specific points

-

Roche 454

ABi SOLiD

HeliScope

Nanopore

Sanger method

Illumina GAII



Automation of Sanger Part II



Automated Sanger Highlights

• 1991: ESTs by Venter • 1995: Haemophilus influenzae genome • 1996: Yeast, archaeal genomes • 1999: Drosophila genome • 2000: Arabidopsis genome • 2000: Human genome • 2004: Shotgun metagenomics



Generation III: Clusters not Clones



Generation III = “NextGen”



NextGen Sequencing OutlineNext-generation sequencing platforms

Isolation and purification of target DNA

Sample preparation

Library validation

Cluster generationon solid-phase Emulsion PCR

Sequencing by synthesis with 3’-blocked reversible

terminatorsPyrosequencing Sequencing by ligation

Single colour imaging

Sequencing by synthesis with 3’-unblocked reversible

terminators

Am

plifi

catio

nSe

quen

cing

Imag

ing

Four colour imaging

Data analysis

Roche 454Illumina GAII ABi SOLiD Helicos HeliScope





PyrosequencingSanger method

-

ABi SOLiD

HeliScope

Nanopore

Roche 454

Illumina GAII


NextGen #1: 454





-

ABi SOLiD

HeliScope

Nanopore

Roche 454

Illumina GAII

NextGen #1: Roche 454From Slideshare presentation of Cosentino Cristian http://www.slideshare.net/cosentia/high-throughput-equencing





-

ABi SOLiD

HeliScope

Nanopore

Roche 454

Illumina GAII

NextGen #1: Roche 454From Slideshare presentation of Cosentino Cristian http://www.slideshare.net/cosentia/high-throughput-equencing




Roche 454 Wokflow

From http://acb.qfab.org/acb/ws09/presentations/Day1_DMiller.pdfhttp://www.slideshare.net/AGRF_Ltd/ngs-technologies-platforms-and-applications


http://acb.qfab.org/acb/ws09/presentations/Day1_DMiller.pdf


ANRV353-GG09-20 ARI 25 July 2008 14:57

Anneal sstDNA to an excess ofDNA capture beads

Emulsify beads and PCRreagents in water-in-oilmicroreactors

Clonal amplification occursinside microreactors

Break microreactors andenrich for DNA-positivebeads

Amplified sstDNA library beads Quality filtered bases

a

b

c

DNA library preparation

Emulsion PCR

Sequencing

A

A

A B

B

B

4.5 hours

8 hours

7.5 hours

Ligation

Selection(isolate ABfragmentsonly)

•Genome fragmented by nebulization•No cloning; no colony picking•sstDNA library created with adaptors•A/B fragments selected using avidin-biotin purification

gDNA sstDNA library

sstDNA library Bead-amplified sstDNA library

•Well diameter: average of 44 µm•400,000 reads obtained in parallel•A single cloned amplified sstDNA bead is deposited per well

390 Mardis

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.

gDNA fragmented by nebulization or sonication

Fragments are end- repaired and ligated to adaptors containing universal priming sites

Fragments are denatured and AB ssDNA are selected by avidin/biotin purification (ssDNA library)

From Mardis 2008. Annual Rev. Genetics 9: 387.

Roche 454 Step 1: Libraries



ANRV353-GG09-20 ARI 25 July 2008 14:57






a

b

c


Emulsion PCR

Sequencing

A

A

A B

B

B

4.5 hours

8 hours

7.5 hours

Ligation



gDNA sstDNA library



390 Mardis

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.


Roche 454 Step 2: Emulsion PCR



ANRV353-GG09-20 ARI 25 July 2008 14:57






a

b

c


Emulsion PCR

Sequencing

A

A

A B

B

B

4.5 hours

8 hours

7.5 hours

Ligation



gDNA sstDNA library



390 Mardis

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.


Roche 454 Step 3: Pyrosequencing



Pyrosequencing

44 µm

Pyrosequecning

Reads are recorded as flowgrams

Annu. Rev. Genomics Hum. Genet., 2008, 9: 387-402Nature Reviews genetics, 2010, 11: 31-46

Sanger method

-

ABi SOLiD

HeliScope

Nanopore

Roche 454

Illumina GAII


Roche 454 Step 3: Pyrosequencing




Roche 454 Key Issues

• Number of repeated nucleotides estimated by amount of light ... many errors

• Reasonable number of failures in EM-PCR and other steps



Roche 454 Evolution

http://www.slideshare.net/AGRF_Ltd/ngs-technologies-platforms-and-applications



NextGen #2: SolexaSequecning by synthesis with reversible terminatorSanger method

Roche 454

ABi SOLiD

HeliScope

Nanopore

Illumina GAII

-





Sequecning by synthesis with reversible terminatorSanger method

Roche 454

ABi SOLiD

HeliScope

Nanopore

Illumina GAII

-


NextGen #2: Solexa Illumina




NextGen #2: Illumina AccessoriesCluster station

Genome Analyzer IIxPaired-end module Linux server

Bioanalyzer 2100

Instrumentation

Sample preparation

Clusters amplification

Sequencing by synthesis

Analysis pipeline

Introduction

Illumina GAII

High throughput





Illumina Outline


Clu

ster

stat

ion

Wash cluster station

Clu

ster

gen

erat

ion

Linearization, Blocking and

primer Hybridization

Read 1

Prepare read 2

Read 2GA

IIx

& P

E

SBS

sequ

enci

ng

Pipeline base call

Data analysis

Sample preparation and

library validation

Ana

lysi

s

Sequencing workflow

Sample preparation



Analysis pipeline

Introduction

Illumina GAII

High throughput





Illumina Step 1: Prep & Attach DNA

ANRV353-GG09-20 ARI 25 July 2008 14:57

to prepare each strand for the next incorpora-tion by DNA polymerase. This series of stepscontinues for a specific number of cycles, as de-termined by user-defined instrument settings,which permits discrete read lengths of 25–35

bases. A base-calling algorithm assigns se-quences and associated quality values to eachread and a quality checking pipeline evaluatesthe Illumina data from each run, removingpoor-quality sequences.

Adapter

DNA fragment

Dense lawnof primers

Adapter

Attached

DNA

Adapters

Prepare genomic DNA sampleRandomly fragment genomic DNAand ligate adapters to both ends ofthe fragments.

Attach DNA to surfaceBind single-stranded fragmentsrandomly to the inside surfaceof the flow cell channels.

Bridge amplificationAdd unlabeled nucleotidesand enzyme to initiate solid-phase bridge amplification.

Denature the doublestranded molecules

Nucleotides

a

Figure 2The Illumina sequencing-by-synthesis approach. Cluster strands created by bridge amplification are primed and all four fluorescentlylabeled, 3′-OH blocked nucleotides are added to the flow cell with DNA polymerase. The cluster strands are extended by onenucleotide. Following the incorporation step, the unused nucleotides and DNA polymerase molecules are washed away, a scan buffer isadded to the flow cell, and the optics system scans each lane of the flow cell by imaging units called tiles. Once imaging is completed,chemicals that effect cleavage of the fluorescent labels and the 3′-OH blocking groups are added to the flow cell, which prepares thecluster strands for another round of fluorescent nucleotide incorporation.

392 Mardis

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.


Step 1: Sample Preparation The DNA sample of interest is sheared to appropriate size (average ~800bp) using a compressed air device known as a nebulizer. The ends of the DNA are polished, and two unique adapters are ligated to the fragments. Ligated fragments of the size range of 150-200bp are isolated via gel extraction and amplified using limited cycles of PCR



Illumina Step 2: Clusters by Bridge PCR

ANRV353-GG09-20 ARI 25 July 2008 14:57

to prepare each strand for the next incorpora-tion by DNA polymerase. This series of stepscontinues for a specific number of cycles, as de-termined by user-defined instrument settings,which permits discrete read lengths of 25–35

bases. A base-calling algorithm assigns se-quences and associated quality values to eachread and a quality checking pipeline evaluatesthe Illumina data from each run, removingpoor-quality sequences.

Adapter

DNA fragment

Dense lawnof primers

Adapter

Attached

DNA

Adapters

Prepare genomic DNA sampleRandomly fragment genomic DNAand ligate adapters to both ends ofthe fragments.

Attach DNA to surfaceBind single-stranded fragmentsrandomly to the inside surfaceof the flow cell channels.

Bridge amplificationAdd unlabeled nucleotidesand enzyme to initiate solid-phase bridge amplification.

Denature the doublestranded molecules

Nucleotides

a

Figure 2The Illumina sequencing-by-synthesis approach. Cluster strands created by bridge amplification are primed and all four fluorescentlylabeled, 3′-OH blocked nucleotides are added to the flow cell with DNA polymerase. The cluster strands are extended by onenucleotide. Following the incorporation step, the unused nucleotides and DNA polymerase molecules are washed away, a scan buffer isadded to the flow cell, and the optics system scans each lane of the flow cell by imaging units called tiles. Once imaging is completed,chemicals that effect cleavage of the fluorescent labels and the 3′-OH blocking groups are added to the flow cell, which prepares thecluster strands for another round of fluorescent nucleotide incorporation.

392 Mardis

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.


• From : http://seqanswers.com/forums/showthread.php?t=21. Steps 2-6: Cluster Generation by Bridge Amplification. In contrast to the 454 and ABI methods which use a bead-based emulsion PCR to generate "polonies", Illumina utilizes a unique "bridged" amplification reaction that occurs on the surface of the flow cell. The flow cell surface is coated with single stranded oligonucleotides that correspond to the sequences of the adapters ligated during the sample preparation stage. Single-stranded, adapter-ligated fragments are bound to the surface of the flow cell exposed to reagents for polyermase-based extension. Priming occurs as the free/distal end of a ligated fragment "bridges" to a complementary oligo on the surface. Repeated denaturation and extension results in localized amplification of single molecules in millions of unique locations across the flow cell surface. This process occurs in what is referred to as Illumina's "cluster station", an automated flow cell processor.


http://seqanswers.com/forums/showthread.php?t=21


Clusters



ANRV353-GG09-20 ARI 25 July 2008 14:57

Applied Biosystems SOLiDTM

SequencerThe SOLiD platform uses an adapter-ligatedfragment library similar to those of the othernext-generation platforms, and uses an emul-sion PCR approach with small magnetic beadsto amplify the fragments for sequencing. Un-like the other platforms, SOLiD uses DNA lig-ase and a unique approach to sequence the am-plified fragments, as illustrated in Figure 3a.Two flow cells are processed per instrumentrun, each of which can be divided to containdifferent libraries in up to four quadrants. Readlengths for SOLiD are user defined between25–35 bp, and each sequencing run yields be-tween 2–4 Gb of DNA sequence data. Once

the reads are base called, have quality values,and low-quality sequences have been removed,the reads are aligned to a reference genome toenable a second tier of quality evaluation calledtwo-base encoding. The principle of two-baseencoding is shown in Figure 3b, which illus-trates how this approach works to differenti-ate true single base variants from base-callingerrors.

Two key differences that speak to the utilityof next-generation sequence reads are (a) thelength of a sequence read from all current next-generation platforms is much shorter than thatfrom a capillary sequencer and (b) each next-generation read type has a unique error modeldifferent from that already established for

b

Laser

First chemistry cycle:determine first baseTo initiate the firstsequencing cycle, addall four labeled reversibleterminators, primers, andDNA polymerase enzymeto the flow cell.

Image of first chemistry cycleAfter laser excitation, capture the imageof emitted fluorescence from eachcluster on the flow cell. Record theidentity of the first base for each cluster.

Sequence read over multiple chemistry cyclesRepeat cycles of sequencing to determine the sequenceof bases in a given fragment a single base at a time.

Before initiating thenext chemistry cycleThe blocked 3' terminusand the fluorophorefrom each incorporatedbase are removed.

GCTGA...

Figure 2(Continued )

www.annualreviews.org • Next-Generation DNA Sequencing Methods 393

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.


Illumina Step 3: Sequencing by Synthesis

From : http://seqanswers.com/forums/showthread.php?t=21. Steps 7-12: Sequencing by Synthesis. A flow cell containing millions of unique clusters is now loaded into the 1G sequencer for automated cycles of extension and imaging. The first cycle of sequencing consists first of the incorporation of a single fluorescent nucleotide, followed by high resolution imaging of the entire flow cell. These images represent the data collected for the first base. Any signal above background identifies the physical location of a cluster (or polony), and the fluorescent emission identifies which of the four bases was incorporated at that position. This cycle is repeated, one base at a time, generating a series of images each representing a single base extension at a specific cluster. Base calls are derived with an algorithm that identifies the emission color over time. At this time reports of useful Illumina reads range from 26-50 bases.


http://seqanswers.com/forums/showthread.php?t=21


SBS technology

Sample preparation



Analysis pipeline

Introduction

Illumina GAII

High throughput


Illumina Step 3: Sequencing by Synthesis




ANRV353-GG09-20 ARI 25 July 2008 14:57

Applied Biosystems SOLiDTM

SequencerThe SOLiD platform uses an adapter-ligatedfragment library similar to those of the othernext-generation platforms, and uses an emul-sion PCR approach with small magnetic beadsto amplify the fragments for sequencing. Un-like the other platforms, SOLiD uses DNA lig-ase and a unique approach to sequence the am-plified fragments, as illustrated in Figure 3a.Two flow cells are processed per instrumentrun, each of which can be divided to containdifferent libraries in up to four quadrants. Readlengths for SOLiD are user defined between25–35 bp, and each sequencing run yields be-tween 2–4 Gb of DNA sequence data. Once

the reads are base called, have quality values,and low-quality sequences have been removed,the reads are aligned to a reference genome toenable a second tier of quality evaluation calledtwo-base encoding. The principle of two-baseencoding is shown in Figure 3b, which illus-trates how this approach works to differenti-ate true single base variants from base-callingerrors.

Two key differences that speak to the utilityof next-generation sequence reads are (a) thelength of a sequence read from all current next-generation platforms is much shorter than thatfrom a capillary sequencer and (b) each next-generation read type has a unique error modeldifferent from that already established for

b

Laser

First chemistry cycle:determine first baseTo initiate the firstsequencing cycle, addall four labeled reversibleterminators, primers, andDNA polymerase enzymeto the flow cell.

Image of first chemistry cycleAfter laser excitation, capture the imageof emitted fluorescence from eachcluster on the flow cell. Record theidentity of the first base for each cluster.

Sequence read over multiple chemistry cyclesRepeat cycles of sequencing to determine the sequenceof bases in a given fragment a single base at a time.

Before initiating thenext chemistry cycleThe blocked 3' terminusand the fluorophorefrom each incorporatedbase are removed.

GCTGA...

Figure 2(Continued )

www.annualreviews.org • Next-Generation DNA Sequencing Methods 393

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.


Illumina Step 3: Cycling



Illumina Evolution




MiSeq Dx



HiSeq x Ten



HiSeq x Ten



NextGen #3: 454: ABI SolidSequecning by ligationSanger method

Roche 454

-

HeliScope

Nanopore

ABi SOLiD

Illumina GAII





ABI Solid DetailsANRV353-GG09-20 ARI 25 July 2008 14:57

A C G T

1st base

2nd base

ACGT

3'TAnnnzzz5'

3'TCnnnzzz5'

3'TGnnnzzz5'

3'TTnnnzzz5'

Cleavage site

Di base probesSOLiD™ substrate

3'TA

AT

Universal seq primer (n)

3'P1 adapter Template sequence

P OH

Universal seq primer (n–1)

Ligase

Phosphatase

+

1. Prime and ligate

2. Image

4. Cleave off fluor

5. Repeat steps 1–4 to extend sequence

3'

Universal seq primer (n–1)1. Melt off extended sequence

2. Primer reset3'

AA AC G

G GG

C C

C

T AA

A GG

CC

T T TT

6. Primer reset

7. Repeat steps 1–5 with new primer

8. Repeat Reset with , n–2, n–3, n–4 primers

TA

AT

AT3'

TA

AT3'

Excite Fluorescence

Cleavage agent

P

HO

TAAA AG AC AAATTT TC TG TT AC

TGCGGC 3'

3. Cap unextended strands

3'

PO4

1 2 3 4 5 6 7 ... (n cycles)Ligation cycle

3'

3'1 μmbead

1 μmbead

1 μmbead

–1


Universal seq primer (n)




3'

3'

3'

3'

3'

1

2

3

4

5

Primer round 1

Template

Primer round 2 1 base shift

Glass slide

3'5' Template sequence1 μmbead P1 adapter

Bridge probe

Bridge probe

Bridge probe

Read position

Indicates positions of interrogation

35 34 3332313029282726252423222120191817161514131211109876543210

Ligation cycle

Prim

er ro

und

1 2 3 4 5 6 7

a

394 Mardis

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

8.9:

387-

402.

Dow

nloa

ded

from

arjo

urna

ls.a

nnua

lrevi

ews.o

rgby

Uni

vers

idad

Nac

iona

l Aut

onom

a de

Mex

ico

on 1

1/17

/09.

For

per

sona

l use

onl

y.

The ligase-mediated sequencing approach of the Applied Biosystems SOLiD sequencer. In a manner similar to Roche/454 emulsion PCR amplification, DNA fragments for SOLiD sequencing are amplified on the surfaces of 1-μm magnetic beads to provide sufficient signal during the sequencing reactions, and are then deposited onto a flow cell slide. Ligase-mediated sequencing begins by annealing a primer to the shared adapter sequences on each amplified fragment, and then DNA ligase is provided along with specific fluorescent- labeled 8mers, whose 4th and 5th bases are encoded by the attached fluorescent group. Each ligation step is followed by fluorescence detection, after which a regeneration step removes bases from the ligated 8mer (including the fluorescent group) and concomitantly prepares the extended primer for another round of ligation. (b) Principles of two-base encoding. Because each fluorescent group on a ligated 8mer identifies a two-base combination, the resulting sequence reads can be screened for base-calling errors versus true polymorphisms versus single base deletions by aligning the individual reads to a known high-quality reference sequence.




ABI Solid Evolution




Complete Genomics

4333 dx.doi.org/10.1021/ac2010857 |Anal. Chem. 2011, 83, 4327–4341

Analytical Chemistry REVIEW

dyskinesia greatly decreased the number of false positive genecandidates which ultimately reduced the number gene candidatesfrom 34 to just four.Just one month later, the second externally published applica-

tion of Complete Genomics’s sequencing technology was re-leased by a research group at Genetech.57 The study comparedthe genome of primary lung tumor cells to that of adjacentnormal tissue obtained from a 51-year-old Caucasian male whoreported a heavy 15-year smoking history. When the completegenomes of different tissue samples were compared, over 50,000single-base variations were identified and 530 previously re-ported single-base mutations were confirmed. Consequently,the importance of complete cancer genome analysis in under-standing cancer evolution and treatment was brought to light dueto the large number of single nucleotide mutations locatedoutside of oncogenes as well as chromosomal structural varia-tions found in the primary lung tumor.A third application of the high-throughput cPAL method

developed by Complete Genomics was published by a researchgroup from the University of Texas Southwestern MedicalCenter in Dallas, Texas (USA).58 This group used whole genomesequencing to diagnose a hypercholesterolemic 11-month-oldgirl with sitosterolemia after a series of blood tests and selectivegenetic sequencing were unable to confer a reasonable diagnosis.The gene and the subsequent mutations responsible for the

sitosterolemia phenotype were determined after comparison ofthe patient’s genome to a collection of reference genomes.Ultimately, it was determined that the patient failed the standardblood test due to low levels of plant sterols that were the result ofa heavy diet of breast milk. This study illustrated the importanceof whole genome sequencing in effectively diagnosing a disease inthe presence of complex environmental factors that can influencestandard assays.

’SEQUENCING BY SYNTHESIS

The idea of sequencing by synthesis has been around for sometime and is the basis for several second-generation sequencingtechnologies including Roche’s 454 sequencing platform andIllumina’s line of sequencing systems. 454’s pyrosequencingmethod, which uses an enzyme cascade to produce light from apyrophosphate released during nucleotide incorporation, wasfirst piloted in the late 1980s and developed for DNA sequencingin themid-1990s.26,59!63 Illumina’s fluorescently labeled sequen-cing by synthesis technique employs fluorescently labeled nu-cleotides with reversible termination chemistry and modifiedpolymerases for improved incorporation of nucleotideanalogues.19 These sequencing by synthesis methods increasedthroughput compared to first-generation sequencing methods;however, optical imaging is needed to detect each sequencing

Figure 3. Schematic of Complete Genomics’ DNB array generation and cPAL technology. (A) Design of sequencing fragments, subsequent DNBsynthesis, and dimensions of the patterned nanoarray used to localize DNBs illustrate the DNB array formation. (B) Illustration of sequencing with a setof common probes corresponding to 5 bases from the distinct adapter site. Both standard and extended anchor schemes are shown. Reprinted withpermission from ref 50. Copyright XXXX American Association for the Advancement of Science.



dyskinesia greatly decreased the number of false positive genecandidates which ultimately reduced the number gene candidatesfrom 34 to just four.Just one month later, the second externally published applica-

tion of Complete Genomics’s sequencing technology was re-leased by a research group at Genetech.57 The study comparedthe genome of primary lung tumor cells to that of adjacentnormal tissue obtained from a 51-year-old Caucasian male whoreported a heavy 15-year smoking history. When the completegenomes of different tissue samples were compared, over 50,000single-base variations were identified and 530 previously re-ported single-base mutations were confirmed. Consequently,the importance of complete cancer genome analysis in under-standing cancer evolution and treatment was brought to light dueto the large number of single nucleotide mutations locatedoutside of oncogenes as well as chromosomal structural varia-tions found in the primary lung tumor.A third application of the high-throughput cPAL method

developed by Complete Genomics was published by a researchgroup from the University of Texas Southwestern MedicalCenter in Dallas, Texas (USA).58 This group used whole genomesequencing to diagnose a hypercholesterolemic 11-month-oldgirl with sitosterolemia after a series of blood tests and selectivegenetic sequencing were unable to confer a reasonable diagnosis.The gene and the subsequent mutations responsible for the

sitosterolemia phenotype were determined after comparison ofthe patient’s genome to a collection of reference genomes.Ultimately, it was determined that the patient failed the standardblood test due to low levels of plant sterols that were the result ofa heavy diet of breast milk. This study illustrated the importanceof whole genome sequencing in effectively diagnosing a disease inthe presence of complex environmental factors that can influencestandard assays.

’SEQUENCING BY SYNTHESIS

The idea of sequencing by synthesis has been around for sometime and is the basis for several second-generation sequencingtechnologies including Roche’s 454 sequencing platform andIllumina’s line of sequencing systems. 454’s pyrosequencingmethod, which uses an enzyme cascade to produce light from apyrophosphate released during nucleotide incorporation, wasfirst piloted in the late 1980s and developed for DNA sequencingin themid-1990s.26,59!63 Illumina’s fluorescently labeled sequen-cing by synthesis technique employs fluorescently labeled nu-cleotides with reversible termination chemistry and modifiedpolymerases for improved incorporation of nucleotideanalogues.19 These sequencing by synthesis methods increasedthroughput compared to first-generation sequencing methods;however, optical imaging is needed to detect each sequencing

Figure 3. Schematic of Complete Genomics’ DNB array generation and cPAL technology. (A) Design of sequencing fragments, subsequent DNBsynthesis, and dimensions of the patterned nanoarray used to localize DNBs illustrate the DNB array formation. (B) Illustration of sequencing with a setof common probes corresponding to 5 bases from the distinct adapter site. Both standard and extended anchor schemes are shown. Reprinted withpermission from ref 50. Copyright XXXX American Association for the Advancement of Science.

Figure 3. Schematic of Complete Genomics’ DNB array generation and cPAL technology. (A) Design of sequencing fragments, subsequent DNB synthesis, and dimensions of the patterned nanoarray used to localize DNBs illustrate the DNB array formation. (B) Illustration of sequencing with a set of common probes corresponding to 5 bases from the distinct adapter site. Both standard and extended anchor schemes are shown.

From Niedringhaus et al. Analytical Chemistry 83: 4327. 2011.



Comparison in 2008

Roche (454) Illumina SOLiD

Chemistry Pyrosequencing Polymerase-based Ligation-based

Amplification Emulsion PCR Bridge Amp Emulsion PCR

Paired ends/sep Yes/3kb Yes/200 bp Yes/3 kb

Mb/run 100 Mb 1300 Mb 3000 Mb

Time/run 7 h 4 days 5 days

Read length 250 bp 32-40 bp 35 bp

Cost per run (total)

$8439 $8950 $17447

Cost per Mb $84.39 $5.97 $5.81From “Introduction to Next Generation Sequencing” by Stefan Bekiranov prometheus.cshl.org/twiki/pub/Main/CdAtA08/CSHL_nextgen.ppt


http://prometheus.cshl.org/twiki/pub/Main/CdAtA08/CSHL_nextgen.ppt


Comparison in 2012

Roche (454) Illumina SOLiD

Chemistry Pyrosequencing Polymerase-based Ligation-based

Amplification Emulsion PCR Bridge Amp Emulsion PCR

Paired ends/sep Yes/3kb Yes/200 bp Yes/3 kb

Mb/run 100 Mb 1300 Mb 3000 Mb

Time/run 7 h 4 days 5 days

Read length 250 bp 32-40 bp 35 bp

Cost per run (total)

$8439 $8950 $17447

Cost per Mb $84.39 $5.97 $5.81From “Introduction to Next Generation Sequencing” by Stefan Bekiranov prometheus.cshl.org/twiki/pub/Main/CdAtA08/CSHL_nextgen.ppt


http://prometheus.cshl.org/twiki/pub/Main/CdAtA08/CSHL_nextgen.ppt


Bells and Whistles



Multiplexing

From http://www.illumina.com/technology/multiplexing_sequencing_assay.ilmn


http://www.illumina.com/technology/multiplexing_sequencing_assay.ilmn


Multiplexing

http://res.illumina.com/documents/products/datasheets/datasheet_sequencing_multiplex.pdf



Small Amounts of DNA

http://www.epibio.com/docs/default-source/protocols/nextera-dna-sample-prep-kit-(illumina--compatible).pdf?sfvrsn=4



Capture MethodsHigh throughput sample preparation

Sample preparation



Analysis pipeline

Introduction

Illumina GAII

High throughput

Nature Methods, 2010, 7: 111-118

RainDanceMicrodroplet PCR

Roche NimblegenSalid-phase capture with custom-

designed oligonucleotide microarray

Reported 84% of capture efficiency

Reported 65-90% of capture efficiency





High throughput sample preparation

Sample preparation



Analysis pipeline

Introduction

Illumina GAII

High throughput

Agilent SureSelectSolution-phase capture with

streptavidin-coated magnetic beads

Reported 60-80% of capture efficiency


Capture Methods




Illumina Paired EndsPaired-end technology

Paired-end sequencing works into GA and uses chemicals from the PE module to perform cluster amplification of the reverse strandSample

preparation



Analysis pipeline

Introduction

Illumina GAII

High throughput





Moleculo

Large fragmentsDNA

Isolate and amplify

CACC GGAA TCTC ACGT AAGG GATC AAAA

Sublibrary w/ unique barcodes

Sequence w/ Illumina

Assemble seqs w/ same codes



Generation III+: Faster w/ Clusters



Ion Torrent PGM



Applied Biosystems Ion Torrent PGM



Applied Biosystems Ion Torrent PGM

Workflow similar to that for Roche/454 systems. !Not surprising, since invented by people from 454.



Ion Torrent pH Based Sequencing

Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem 2013;6:287-303.




Ion Torrent Evolution



Generation IV: Single Molecule



Single Molecule I: Helicos

Page 8 Barbara Hutter 3rd Generation Sequencing

Helicos Genetic Analysis System

● Close to 3rd Gen boundary

● First DNA-sequencing instrument to operate by

imaging individual DNA molecules

● Individual DNA molecules fixed to a surface

● Proprietary Virtual Terminator nucleotides allow for

step-wise sequencing

● ~ 1 billion molecules sequenced in 8 days ∼

● High raw error rate (over 5%) improved by

consensus sequencing

● Reads only ~ 32 nucleotides

● Higher costs than 2nd Gen sequencing

● Direct RNA sequencing possible

● Helicos BioSciences have re-focused on molecular

diagnostics

http://en.wikipedia.org/wiki/Single_molecule_fluorescent_sequencing



Single Molecule II: Pacific Biosciences




Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem 2013;6:287-303.






Φ29 polymerase. Each amplified product of a circularizedfragment is called a DNA nanoball (DNB). DNBs are selectivelyattached to a hexamethyldisilizane (HMDS) coated silicon chipthat is photolithographically patterned with aminosilane activesites. Figure 3A illustrates the DNB array design.The use of the DNBs coupled with the highly patterned array

offers several advantages. The production of DNBs increasessignal intensity by simply increasing the number of hybridizationsites available for probing. Also, the size of the DNB is on thesame length scale as the active site or “sticky” spot patterned onthe chip, which results in attachment of one DNB per site. Sincethe active sites are spaced roughly 1 μm apart, up to three billionDNB can be fixed to a 1 in. by 3 in. silicon chip.51 In addition toincreasing the number of sequencing fragments per chip, thelength scales of the size and spacing of the DNBs maximizes thepixel use in the detector. This highly efficient approach togenerating a hybridization array results in decreased reagentcosts and increased throughput compared to other secondgeneration DNA sequencing arrays that have been used tosequence complete human genomes.19,22,52

Once the DNB array chip is generated, a library of fortycommon probes is used in combination with standard anchorsand extended anchors to perform an unchained hybridizationand ligation assay. The forty common probes consist of twosubsets: probes that interrogate 50 of the distinct adapter site inthe DNB and probes that interrogate 30 of the distinct adaptersite in the DNB. In each subset, there are five sets of fourcommon probes; each probe is 9 bases in length. Each setcorresponds to positions 1 to 5 bases away from the distinctadapter sites in the sequencing substrate, and within each set,there are four distinct markers corresponding to each base. Thestandard anchors bind directly to the 50 or 30 end of the adaptersite on the DNB and allow for hybridization and ligation of thecommon probes. The extended anchor scheme consists ofligation of a pair of oligo anchors (degenerate and standard) toexpand the hybridized anchor region 5 bases beyond the adaptersites in the DNB and into the sequencing region. This combi-natorial probe-anchor ligation (cPAL) method developed byComplete Genomics extends read lengths from 5 bases to 10bases and results in a total of 62 to 70 bases sequenced per DNB.A schematic demonstrating both the standard anchor schemeand the extended anchor scheme is shown in Figure 3B.

Each hybridization and ligation cycle is followed by fluorescentimaging of the DNB spotted chip and subsequently regenerationof the DNBs with a formamide solution. This cycle is repeateduntil the entire combinatorial library of probes and anchors isexamined. This formula of the use of unchained reads andregeneration of the sequencing fragment reduces reagent con-sumption and eliminates potential accumulation errors that canarise in other sequencing technologies that require close tocompletion of each sequencing reaction.19,52,53

Complete Genomics showcased their DNB array and cPALtechnology by resequencing three human genomes and reportedan average reagent cost of $4,400 per genome.50 The threegenome samples sequenced in this study (NA07022, NA19240,and NA20431) were then compared to previous sequencedata.54,55 The average coverage of these samples ranged from45X to 87X, and the percent of genome identified ranged from86% to 95%. While this technology greatly increases throughputcompared to Sanger/CE and second-generation sequencingtechnologies, there are several drawbacks to CompleteGenomics’approach. First, the construction of circular sequencing fragmentsresults in an underrepresentation of certain genome regions,which leads to partial genome assembly downstream. Also, thesize of the circular sequencing fragments (∼400 bases) as well asthe very short read lengths (∼10 bases) prevents complete andaccurate genome assembly, given that these fragments are shorterthan a number of the long repetitive regions.Just five months after Complete Genomics’ proof-of-concept

study was published, the first externally published application ofComplete Genomic’s sequencing technology was released. Agroup at the Institute for Systems Biology in Seattle, Washington(USA), studied the genetic differences in a human family offour.56 In the study, whole-genome sequencing was usedto determine four candidate genes responsible for two rareMendelian disorders: Miller syndrome and primary ciliary dys-kinesia. The subjects were a set of parents and their two childrenwho both suffer from the disorders. This study highlighted thebenefits of whole genome sequencing within a family whendetermining Mendelian traits. The ability to recognize inheri-tance patterns greatly reduced the genetic search space forrecessive disorders and increased the sequencing accuracy. Inthe end, sequencing the entire family instead of just the twosiblings affected by Miller syndrome and primary ciliary

Figure 2. Schematic of PacBio’s real-time single molecule sequencing. (A) The side view of a single ZMW nanostructure containing a single DNApolymerase (Φ29) bound to the bottom glass surface. The ZMW and the confocal imaging system allow fluorescence detection only at the bottomsurface of each ZMW. (B) Representation of fluorescently labeled nucleotide substrate incorporation on to a sequencing template. The correspondingtemporal fluorescence detection with respect to each of the five incorporation steps is shown below. Reprinted with permission from ref 39. Copyright2009 American Association for the Advancement of Science.

Figure 2. Schematic of PacBio’s real-time single molecule sequencing. (A) The side view of a single ZMW nanostructure containing a single DNA polymerase (Φ29) bound to the bottom glass surface. The ZMW and the confocal imaging system allow fluorescence detection only at the bottom surface of each ZMW. (B) Representation of fluorescently labeled nucleotide substrate incorporation on to a sequencing template. The corresponding temporal fluorescence detection with respect to each of the five incorporation steps is shown below.





Why Finish Genomes?

The Value of Finished Bacterial Genomes

Why Are Finished Genomes So Important?When Sanger sequencing was the only available sequencing technique, it was expensive — but not unusual — to improve genome drafts until they were good enough to be considered finished. With the availability of short-read sequencing technologies, draft genomes became cheap and easy to produce, and the majority of researchers skipped the more labor- and time-intensive task of finishing genomes, with the realization that critical data may be missing (Figure 3). Finished genomes are crucial for understanding microbes and advancing the field of microbiology3 because:

• Functional genomic studies demand a high-quality, complete genome sequence as a starting point

• Comparative genomics is meaningful only in terms of complete genome sequences

• Understanding genome organization provides biological insights

• Microbial forensics requires at least one complete reference genome sequence

• Finished genomes aid in microbial outbreak source identification and phylogenetic analysis

• A complete genome is a permanent scientific resource

Figure 3: History of drafted vs. finished genomes (adapted from ref. 2).

Microbial Genetics Using SMRT Sequencing

0

2000

4000

6000

8000

10000

12000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Num

ber o

f gen

omes

Drafted Bacterial Genomes

Finished Bacterial Genomes



Why Finish Genomes

JOURNAL OF BACTERIOLOGY, Dec. 2002, p. 6403–6405 Vol. 184, No. 230021-9193/02/$04.00!0 DOI: 10.1128/JB.184.23.6403–6405.2002Copyright © 2002, American Society for Microbiology. All Rights Reserved.

DIALOG

The Value of Complete Microbial Genome Sequencing(You Get What You Pay For)

Claire M. Fraser,* Jonathan A. Eisen, Karen E. Nelson, Ian T. Paulsen,and Steven L. Salzberg

The Institute for Genomic Research, Rockville, Maryland 20850

Since the publication of the complete Haemophilus influen-zae genome sequence in July 1995 (4), the field of microbiologyhas been one of the largest beneficiaries of the breakthroughsin genomics and computational biology that made this accom-plishment possible. When the 1.8-Mbp H. influenzae projectbegan in 1994, it was not certain that the whole-genome shot-gun sequencing strategy would succeed because it had neverbeen attempted on any piece of DNA larger than an averagelambda clone ("40 kbp) (9).

During the past 7 years, progress in DNA sequencing tech-nology, the design of new vectors for library construction foruse in shotgun sequencing projects, significant improvementsin closure and finishing strategies, and more sophisticated androbust methods for gene finding and annotation have dramat-ically reduced the time required for each stage of a genomeproject and the cost per base pair while at the same timeproducing a finished product of higher quality than was possi-ble just a few years ago. Today, the random sequencing phaseof a genome project, representing more than 99% of the ge-nome sequence, can easily be completed in just a few days at acost of approximately 3 to 4¢ per bp. The early release of suchdraft data is of benefit to the scientific community, and thereare numerous examples of how access to incomplete data hashad a significant impact in many areas of microbial research.

Comparable breakthroughs have also been achieved in clo-sure strategies in centers such as The Institute for GenomicResearch (TIGR) and the Pathogen Sequencing Unit at theSanger Centre, which routinely produce complete microbialgenome sequencing data, and closure and annotation can usu-ally be accomplished in a matter of a few months. The cost forgenerating a closed microbial genome sequence with the shot-gun approach has been reduced by an order of magnitude fromwhat it was in 1995 to approximately 8 to 9¢/bp today in largecenters that routinely handle large numbers of projects.

Given the advances in sequencing technologies, we weredismayed when the Department of Energy changed the strat-egy for its microbial genomic sequencing program in 1998 toone in which only high-coverage draft sequences for organismsof interest would be generated by the Joint Genome Institute.The rationale for such a change was that this would allow more

organisms to be sampled because of the cost savings that wouldcome from not taking each project to completion. While thisstrategy does achieve a cost savings, today it is only approxi-mately 50%, and this comes at a cost in terms of the quality andutility of the finished product.

A complete genome sequence represents a finished productin which the order and accuracy of every base pair have beenverified. In contrast, a draft sequence, even one of high cov-erage, represents a collection of contigs of various sizes, withunknown order and orientation, that contain sequencing errorsand possible misassemblies. As stated by Selkov et al. in a 2000paper on a draft sequence of Thiobacillus ferrooxidans, “It isclear that such sequencing. . .produces more errors than com-plete genome sequencing. . . . The current error rate is esti-mated to be 1 per 1,000 to 2,000 base pairs vs. 1 in 10,000 basepairs for complete sequencing” (10). In fact, the difference ismuch greater; recent studies show the error rate for completedmicrobial genomes to be closer to 1 in 100,000 (3). Anotherproblem associated with draft sequence data is library contam-ination with DNA from foreign sources that can represent 5 to10% of the total number of sequence reads for libraries pre-pared from DNA isolated from endosymbiotic and parasiticmicrobes that must be grown in animal cells. Until a genomeproject has been closed, it is often difficult to identify contam-inating sequences, and these can confound subsequent com-parative and functional genomics studies.

A retrospective analysis of 17 microbial genome sequencescompleted at TIGR during the past few years also revealedthat when these genome projects entered closure, the extent ofgenome completion and the accuracy of assembly varied sig-nificantly (I. Paulsen, unpublished data). For example, at eight-fold sequence coverage, the Thermotoga maritima genome wasrepresented by 98 contigs (#1 kb in size) and was missing only26 genes ("1.5% of the total) in the final annotation (7). Thiscontrasts with the Streptococcus pneumoniae genome of similarsize, whose initial assembly contained 265 contigs and wasmissing 115 genes ("6% of the total) (11). This differencelikely reflects the fact that the genomes of some microbes(gram-positive organisms, for example) are not well repre-sented in random DNA libraries. Some of the most interestingbiology may be encoded in the missing genes of each organism.Due to the larger percentage of repetitive DNA in the S.pneumoniae genome, many of the initial contigs containedmisassemblies that were revealed only during genome closure.Currently there is no method for assigning quality values to

* Corresponding author. Mailing address: The Institute forGenomic Research, 9712 Medical Center Dr., Rockville, MD 20850.Phone: (301) 838-3504. Fax: (301) 838-0209. E-mail: [email protected].

6403

on March 9, 2014 by guest

http://jb.asm.org/

Dow

nloaded from

http://jb.asm.org/content/184/23/6403.full



HGAP Assembly from PacBio

PacBio assembly CDC assembly

Illumina assemblySanger validation

HGAP Assembler for PacBio Data

http://www.pacificbiosciences.com/pdf/microbial_primer.pdf



Detecting Modified Bases

Page 2 www.pacb.com/basemod

White Paper Base Modifications

Methylation has gathered interest from researchers in a variety of disciplines including early developmental biology, cancer biology, and neurological disorders. Methylation and de-methylation help regulate gene expression and have been linked to several human diseases through mechanisms such as deactivating tumor suppressors or activating oncogenes5.

Other modifications, such as 6-mA in bacteria, have been studied with lower resolution methods – such as chromatography or through methylation’s protective effect against restriction endonucleases – because they are not easily accessible with standard sequencing techniques. This modification is associated with basic functions such as DNA replication and repair6. It is also common in protists and plants, and some studies suggest that it may also be present in mammalian DNA7.

SMRT sequencing is capable of detecting 6-mA as well as other common bacterial base modifications. As a result, the technology is expected to increase our understanding of a broad array of biological processes.

The potential benefits of detecting base modification, using SMRT sequencing, include:

x Single-base resolution detection of a wide

variety of base modifications (including those in Figure 1 and more).

x Single-molecule resolution over long-read distances.

x Unamplified double-stranded input DNA, which means that strand-specific modifications, such as hemimethylation, are detectable.

x Hypothesis-free base detection which allows discovery of unknown or unexpected modifications through the effects on sequencing kinetics (as described below).

Studying Polymerase Kinetics with SMRT® Sequencing

SMRT Sequencing allows the observation of single DNA polymerases reading individual molecules of DNA in real time. Therefore, the kinetic characteristics of DNA polymerization are observable on a single-molecule basis. The kinetic characteristics, such as the time duration between two successive base incorporations, are altered by the presence of a modified base in the DNA template3. This is observable as an increased space between fluorescence pulses, which is called the interpulse duration (IPD), as shown in Figure 2.

Figure 2. Principle of detecting modified DNA bases during SMRT sequencing. The presence of the modified base in the DNA template (top), shown here for 6-mA, results in a delayed incorporation of the corresponding T nucleotide, i.e. longer interpulse duration (IPD), compared to a control DNA template lacking the modification (bottom).3

http://www.pacificbiosciences.com/pdf/microbial_primer.pdf



This diagram shows a protein nanopore set in an electrically resistant membrane bilayer. An ionic current is passed through the nanopore by setting a voltage across this membrane. If an analyte passes through the pore or near its aperture, this event creates a characteristic disruption in current. By measuring that current it is possible to identify the molecule in question. For example, this system can be used to distinguish the four standard DNA bases and G, A, T and C, and also modified bases. It can be used to identify target proteins, small molecules, or to gain rich molecular information for example to distinguish the enantiomers of ibuprofen or molecular binding dynamics.

From Oxford Nanopores Web Site

Single Molecule III: Oxford Nanopores



• Figure6. BiologicalnanoporeschemeemployedbyOxfordNanopore.(A)SchematicofRHLproteinnanoporemutantdepictingthepositionsofthe cyclodextrin (at residue 135) and glutamines (at residue 139). (B) A detailed view of the β barrel of the mutant nanopore shows the locations of the arginines (at residue 113) and the cysteines. (C) Exonuclease sequencing: A processive enzyme is attached to the top of the nanopore to cleave single nucleotides from the target DNA strand and pass them through the nanopore. (D) A residual current-vs-time signal trace from an RHL protein nanopore that shows a clear discrimination between single bases (dGMP, dTMP, dAMP, and dCMP). (E) Strand sequencing: ssDNA is threaded through a protein nanopore and individual bases are identified, as the strand remains intact. Panels A, B, and D reprinted with permission from ref 91. Copyright 2009 Nature Publishing Group. Panels C and E reprinted with permission from Oxford Nanopore Technologies (Zoe McDougall).



detected across a metal oxide-silicon layered structure. Thevoltage signal is induced across the capacitor by the passage ofcharged nucleotides in longitudinal direction.79 A different read-out approach is optical detection (Figure 5B). A typical opticalrecognition of nucleotides is essentially executed in two steps.First, each base (A, C, G, or T) in the target sequence isconverted into a sequence of oligonucleotides, which are thenhybridized to two-color molecular beacons (with fluorophoresattached).80 Because the four nucleotides (A, C, G, or T) have tobe determined, the two fluorescent probes are coupled in pairs touniquely define each base. For example, if the two probes are Aand B, the four unique permutations will be AA, AB, BA, and BB.As the hybridized DNA strand is threaded through the nanopore,the fluorescent tag is stripped off from its quencher and an opticalsignal is detected. Both protein81 and solid-state nanopores canbe used.80,82 Detailed electronic measurement schemes andoptical readout methods have been reviewed in more detail inpreviously published papers.71,83,84

In a 2008 review article,71 Daniel Branton and colleaguesdiscussed the nanopore sequencing development and the pros-pect of low sample preparation cost at high throughput. Theyestimated that purified genomic DNA sufficient for sequencing(∼108 copies or 700 μg) could be extracted and purified fromblood at a cost of less than $40/sample using commercial kits. Allexisting sequencing techniques require breaking the DNA intosmall fragments of ∼100 bps and sequencing those chunksmultiple times to find overlapping regions, so that they can bereassembled together. Because one of most appealing advantagesof nanopores is achieving long read lengths, the genomic assemblyprocess should be considerably simplified. In practice, the readlength may be limited only by the DNA shearing that occursduring pipetting in the sample preparation step. For example,Meller and Branton85 demonstrated that 25 kb ssDNA could bethreaded through a biological nanopore and 5.4 kb ssDNA

translocated through a solid-state nanopore. In addition, severalgroups have shown very high throughput of small oligonucleo-tides (∼5.8 oligomers (s μM)!1)85 and native ssDNA anddsDNA (∼3!10 kb at 10!20 nM concentration).86,87

Protein Nanopore Sequencing.Oxford Nanopore Technol-ogies, formerly Oxford Nanolabs, together with leading academiccollaborators, has addressed some of the aforementioned tech-nological challenges and implemented the nanopore technologyin a commercial product (GridION system). Oxford Nanopore,founded by Prof. Hagan Bayley at University of Oxford, aimed atcommercializing the research work on biological nanoporescoming out of his laboratory. The company works in collabora-tion with Professors Daniel Branton, George Church, and JeneGolovchenko at Harvard; David Deamer and Mark Akeson atUCSC; and John Kasianowicz at NIST.Recently, Gordon Sanghera, CEO of Oxford Nanopore,

announced that the company is preparing to launch the GridIONsystem88 for direct single-molecule analysis, which would adoptexonuclease sequencing. The system is based on “lab on a chip”technology and integrates multiple electronic cartridges into arack-like device. A single protein nanopore is integrated in a lipidbilayer across the top of a microwell, equipped with electrodes.Multiple microwells are incorporated onto an array chip, andeach cartridge holds a single chip with integrated fluidics andelectronics for sample preparation, detection, and analysis. Thesample is introduced into the cartridge, which is then inserted inan instrument called a GridION node. Each node can be usedseparately or in a cluster, and all nodes communicate with eachother and with the user’s network and storage system in real time.Although the main application of the platform is sequencing ofDNA, it can be adapted (by proper modification of the RHLnanopore) for the detection of proteins and small molecules.Oxford Nanopore’s first-generation systems utilize the hepta-

meric protein R-hemolysin (RHL) (Figure 6A).72,89!91 RHL is

Figure 6. Biological nanopore scheme employed by Oxford Nanopore. (A) Schematic of RHL protein nanopore mutant depicting the positions of thecyclodextrin (at residue 135) and glutamines (at residue 139). (B) A detailed view of the β barrel of the mutant nanopore shows the locations ofthe arginines (at residue 113) and the cysteines. (C) Exonuclease sequencing: A processive enzyme is attached to the top of the nanopore to cleave singlenucleotides from the target DNA strand and pass them through the nanopore. (D) A residual current-vs-time signal trace from anRHLprotein nanoporethat shows a clear discrimination between single bases (dGMP, dTMP, dAMP, and dCMP). (E) Strand sequencing: ssDNA is threaded through aprotein nanopore and individual bases are identified, as the strand remains intact. Panels A, B, and D reprinted with permission from ref 91. Copyright2009 Nature Publishing Group. Panels C and E reprinted with permission from Oxford Nanopore Technologies (Zoe McDougall).

Figure6. BiologicalnanoporeschemeemployedbyOxfordNanopore.(A)SchematicofRHLproteinnanoporemutantdepictingthepositionsofthe cyclodextrin (at residue 135) and glutamines (at residue 139). (B) A detailed view of the β barrel of the mutant nanopore shows the locations of the arginines (at residue 113) and the cysteines. (C) Exonuclease sequencing: A processive enzyme is attached to the top of the nanopore to cleave single nucleotides from the target DNA strand and pass them through the nanopore. (D) A residual current-vs-time signal trace from an RHL protein nanopore that shows a clear discrimination between single bases (dGMP, dTMP, dAMP, and dCMP). (E) Strand sequencing: ssDNA is threaded through a protein nanopore and individual bases are identified, as the strand remains intact. Panels A, B, and D reprinted with permission from ref 91. Copyright 2009 Nature Publishing Group. Panels C and E reprinted with permission from Oxford Nanopore Technologies (Zoe McDougall).







would result, in theory, in detectably altered current flow throughthe pore. Theoretically, nanopores could also be designed tomeasure tunneling current across the pore as bases, each with adistinct tunneling potential, could be read. The nanopore ap-proach, while still in development, remains an interesting potentialfourth-generation technology. This “fourth-generation”moniker issuggested, since optical detection is eliminated along with therequirement for synchronous reagent wash steps.

Nanopore technologies may be broadly categorized into twotypes, biological and solid-state. The protein alpha hemolysin,which natively bridges cellular membranes causing lysis, was firstused as a model biological nanopore. The protein was insertedinto a biolayer membrane separating two chambers while sensi-tive electronics measured the blockade current, which changed asDNA molecules moved through the pore. However, chemicaland physical similarities between the four nucleotides made thesequence much more difficult to read than envisioned. Further,sufficient reduction of electronic noise remains a constantchallenge, which is achievable in part by slowing the rate ofDNA translocation. Recently, Oxford Nanopore and severalother academic groups working on this concept have madeprogress toward addressing these challenges.

The second class is based on the use of nanopores fabricatedmechanically in silicon or other derivative. The use of thesesynthetic nanopores alleviates the difficulties ofmembrane stabilityand protein positioning that accompany the biological nanoporesystem Oxford Nanopore has established. For example, Nabsyscreated a system using a silicon wafer drilled with nanopores usinga focused ion beam (FIB), which detects differences in blockadecurrent as single-strandedDNAboundwith specific primers passesthrough the pore. IBM created a more complex device that aimsto actively pause DNA translocation and interrogate each base fortunneling current during the pause step. The technology of bothof these nanopore types are presented in detail below.

John Kasianowicz and colleagues73 were the first to show thetranslocation of polynucleotides (poly[U]) through a Staphylo-coccal R-hemolysin (RHL) biological nanopore suspended in a

lipid bilayer, using ionic current blockage method. The authorspredicted that single nucleotides could be discriminated as longas: (1) each nucleotide produces a unique signal signature; (2)the nanopore possesses proper aperture geometry to accommo-date one nucleotide at a time; (3) the current measurements havesufficient resolution to detect the rate of strand translocation; (4)the fragment should translocate in a single direction whenpotential is applied; and (5) the nanopore/supporting mem-brane assembly should be sufficiently robust. All of the biologicaland synthetic nanopores have barrels of ∼5 nm (which isconsiderably longer than the base-to-base distance of 3.4 Å) inthickness and accommodate∼10!15 nucleotides at a time. It is,therefore, impossible to achieve single-base resolution usingblockage current measurements. In addition, the average rateat which a polymer typically translocates through a nanopore ison the order of 1 nucleotide/μs (i.e., on the order of MHzdetection), which is too fast to resolve. The nucleotide strandshould be slowed down to ∼1 nucleotide/ms to allow for a pA-current signal at 120!150 mV applied potential.74 Furthermore,the translocation of a polymer strand should be uniform betweentwo events. The time distribution of two processes (capture,entry, and translocation) is nonPoisson and often differs by anorder of magnitude. This means that two molecules pass througha nanopore at considerably different rates and the slower onecould be missed or misinterpreted. Andre Marziali and co-workers at UBC75,76 used force spectroscopy to study theseevents through single-molecule bond characterization. The non-uniform kinetics of DNA passage through an RHL nanopore isattributed to weak binding of DNA to amino acid residues of theprotein nanopore.77

Because of these challenges with ionic current measurements(the current created by the flow of ions through the nanopore),researchers have looked at other measurement schemes such asthe detection of tunneling current and capacitance changes(Figure 5A). In transverse tunneling current scheme, electrodesare positioned at the pore opening and the signal is detected fromsubnanometer probes.78 In capacitance measurements, voltage is

Figure 5. Nanopore DNA sequencing using electronic measurements and optical readout as detection methods. (A) In electronic nanopore schemes,signal is obtained through ionic current,73 tunneling current,78 and voltage difference79 measurements. Eachmethodmust produce a characteristic signalto differentiate the four DNAbases. Reprinted with permission from ref 83. Copyright 2008 Annual Reviews. (B) In the optical readout nanopore design,each nucleotide is converted to a preset oligonucleotide sequence and hybridized with labeledmarkers that are detected during translocation of the DNAfragment through the nanopore. Reprinted from ref 82. Copyright 2010 American Chemical Society.

Nanopore DNA sequencing using electronic measurements and optical readout as detection methods.(A)In electronic nanopore schemes, signal is obtained through ionic current,73 tunneling current, and voltage difference measurements. Each method must produce a characteristic signal to differentiate the four DNA bases. (B) In the optical readout nanopore design, each nucleotide is converted to a preset oligonucleotide sequence and hybridized with labeled markers that are detected during translocation of the DNA fragment through the nanopore.





Oxford Nanopores MinIon

“It’s kind of a cute device,” says Jaffe of the MinION, which is roughly the size and shape of a packet of chewing gum. “It has pretty lights and a fan that hums pleasantly, and plugs into a USB drive.” But his technical review is mixed. From http://www.nature.com/news/data-from-pocket-sized-genome-sequencer-unveiled-1.14724


http://www.nature.com/news/data-from-pocket-sized-genome-sequencer-unveiled-1.14724



Education

Evolution of DNA Sequencing - talk by Jonathan Eisen for the Bodega Workshop in Applied Phylogenetics