Genome&Sequencing&(Part1)& - Colorado State Universitycs680/Slides/lecture4.pdf · Sample&Preparaon & Fragments De-novo assembly requires higher coverage. At least 30x but upwards

Genome Sequencing (Part 1)

Lecture 4: August 30, 2012

Review from Last Lecture

De novo vs. Re-‐sequencing •  De novo assembly (“from the beginning”) implies that you have no prior knowledge of the genome. No reference, no conNgs, only reads.

•  Re-‐sequencing assembly assumes you have a copy of the reference genome (that has been verified to a certain degree).

•  The programs that work for re-‐sequencing will not work for de novo and vice versa. However, both can create copies of the genome.

De novo vs. Re-‐sequencing

Sample PreparaNon

Fragments

Re-sequencing (LOCAS, Shrimp) requires 15x to 30x coverage. Anything less and re-sequencing programs will not produce results or produce questionable results.

Sample PreparaNon

Fragments

De-novo assembly requires higher coverage. At least 30x but upwards to 100x’s coverage. Most de novo assemblers require paired-end data.

IntroducNon and History

Sample PreparaNon

Sample PreparaNon

Fragments

Sample PreparaNon

Sequencing

ACGTAGAATCGACCATG

GGGACGTAGAATACGAC

ACGTAGAATACGTAGAA

Reads

Fragments

Next GeneraNon Sequencing (NGS)

Sample PreparaNon

Sequencing

Assembly

ACGTAGAATACGTAGAAACAGATTAGAGAG…

ConNgs

Fragments

Reads

ACGTAGAATCGACCATG GGGACGTAGAATACGAC

ACGTAGAATACGTAGAA

Sample PreparaNon

Sequencing

Assembly

Analysis

Fragments

Reads

ConNgs

Sample PreparaNon

Sequencing

Assembly

Analysis

Fragments

Reads

ConNgs

Our focus for today’s lecture: 1.  Comparison of sequencing

plaXorms 2.  Details of sample preparaNon 3.  DefiniNons and terminologies

concerning data and sequencing plaXorms

Landmarks in Sequencing Efficiency (bp/person/year)

Year Event

1870 Miescher: Discovers DNA

1940 Avery: Proposes DNA as “GeneNc Material”

1953 Watson & Crick: Double Helix Structure of DNA

1 1965 Holley: Sequenced transfer RNA from Yeast

1,500 1977 Maxam & Gilbert: "DNA sequencing by chemical degradaNon” Sanger: “DNA sequencing with chain-‐terminaNng inhibitors”

1980 Messing: DNA cloning

15,000 1981 Messing: Messing and his colleagues developed “shotgun sequencing” method

25,000 1986 Hood et al.: ParNal AutomaNon

1987 ABI markets the first sequencing plaXorm, ABI 370

Landmarks in Sequencing Efficiency (bp/person/year)

Year Event

50,000 1990 NIH begins large-‐scale sequencing trials of bacteria genomes.

200,000 1995 Craig Venture and Hamilton Smith at the InsNtute for Genomic Research (TIGR) published the first complete genome of a free-‐living organism in Science. This marks the first use of whole-‐genome shotgun sequencing, eliminaNng the need for iniNal mapping efforts.

2001 A drai of the human genome was published in Science.

2001 A drai of the human genome was published in Nature.

50,000,000 2002 454 Life Sciences comes out with a pyrosequencing machine.

100,000,000 2008 Next generaNon sequencing machines arrive.

Huge 2011 Oxford Nanopore: 600 Million base pairs per hour.

Robert Holley and team in 1965

Watson and Crick

Messing: World’s most-‐cited scienNst

Francis and Collins: Private Human Genome project.

Next-‐Gen Sequencing PlaXorms

454/Roche GS-‐20/FLX (2005)

PacBio RS (2009-‐2010) 3rd generaNon?

Illumina HISeq (2007)

Comparison of NGS PlaXorms

Technology Reads per run Average Read Length

bp per run Types of errors

454 (Roche) 400,000 250-‐1000bp 70 Million SubsNtuNon

SoLID (ABI) 88-‐132 Million 35bp 1 Billion

Illumina HISeq 150 Million 100 – 200bp 15 Billion SubsNtuNon with exponenNal increase

PacBio 45,000 1000-‐2000bp 45 Million InserNons and deleNons

\

Sequencing Methods and Terminology

Sanger Sequencing

•  The key principle of the Sanger method was the dideoxynucleoNde triphosphates (ddNTPs) as DNA chain terminators.

•  These ddNTPs will also be radioacNvely for detecNon in automated sequencing machines.

•  PosiNves: longer reads (600 to 1000 bp). •  NegaNves: poor coverage (6x), expensive, inaccurate.

•  SNll commonly used for small scale sequencing.

Sanger Sequencing Video

Sanger Sequencing SHEAR DNA target sample

Sanger Sequencing SHEAR DNA target sample

A A A A

C G T

C G T

C G T

C G T

Close each fragment many times.

Sanger Sequencing

28

SHEAR DNA target sample

A A A A

C G T

C G T

C G T

C G T A C

G

T

Sanger Sequencing

A C G T A

DNA polymerase Primer

Sanger Sequencing

A C G T A

DNA polymerase Primer

Primer

DNA polymerase

A C G T A

Sanger Sequencing

A C G T A

Primer

DNA polymerase

A

A

A

G

G

C

C

C T

C T A

C T

G

Sanger Sequencing

A C G T A

Primer

DNA polymerase

A

A

A

G

G

C

C

C T

C T A

C T

G

G

Sanger Sequencing

A C G T A

Primer A

A

A

G

G

C

C

C T

C T A

C T

C G

G

Sanger Sequencing

A C G T A

Primer A

A

A

G

G

C

C

C T

C T A

C T

G A

C G

G

Sanger Sequencing

A C G T A

Primer A

A

A

G

G

C

C

C T

C T A

C T

T

Sanger Sequencing

Primer A

A

A

G

G

C

C

C T

C T A

C T

Sanger Sequencing

Primer A

A

A

G

G

C

C

C T

C T A

C T

ConNnue unNl all strands of DNA have undergone this reacNon. If you choose the reagents correctly then you should have all possible A-‐terminated strands; resulNng in sequences of varying lengths.

Sanger Sequencing

Sanger Sequencing

In the radioacNve gel, the longer DNA fragments move to the bopom and the shorter ones move to the top. Aierward the sequence can be read off by going from top to bopom.

To recap… •  Sanger Sequencing:

–  Run a PCR reacNon in the presence of a bunch of ddNTPs, with each different base pair dyed a different color.

– Measure the length and color of the resulNng fragments of DNA, and use that to work out the sequence.

•  Requires a lot of space and Nme: you need a place to run the reacNon, and then you need a capillary tube or a gel to determine the length of the DNA. –  You could only run perhaps a hundred of these reacNons at any one Nme.

–  There are 3 billion base pairs of DNA in the human genome, meaning about 6 million 500-‐base pair fragments of DNA.

40

Celera Sequencing (2001)

•  300 ABI DNA sequencing plaXorms •  50 producNon staff •  20,000 square feet of wet lab space •  1 million dollars / year for electrical service •  10 million dollars in reagents

Total cost of human genome: 2.7 Billion dollars

Celera Sequencing (2001)

•  300 ABI DNA sequencing plaXorms •  50 producNon staff •  20,000 square feet of wet lab space •  1 million dollars / year for electrical service •  10 million dollars in reagents

Current cost of human genome: < 10,000 $

•  Second GeneraNon sequencing techniques overcome the restricNons by finding ways to sequence the DNA without having to move it around.

•  You sNck the bit of DNA you want to sequence in a liple dot, called a cluster, and you do the sequencing there; as a result, you can pack many millions of clusters into one machine.

Second/Next GeneraNon Sequencing

Sequencing a strand of DNA while keeping it held in place is tricky, and requires a lot of cleverness.

Illumina Sequencing Pipeline 1. Sample preparaNon (1-‐5 days)

ligate adapters

2. Cluster generaNon on flow cell

3. Sequencing and Imaging (1 week)

(1.5 days)

4. Analysis (days, months, years…)

Illumina Sequencing: Video

We mulNply up the template stand, i.e. the bit of DNA that we are sequencing, and sNck on a few bases of ‘adaptor sequence’; this sequence sNcks on to complementary bits of DNA stuck to a surface, which holds the DNA in place while we sequence it:

We then flood the DNA with RT-‐bases. We also add a polymerase enzyme, which incorporates the RT-‐base into the new strand that is complementary to the template strand:

We then wash away all the RT-‐bases, leaving just those that were incorporated into the new strand; we can read off what base this is by looking at the color of the dye:

•  Finally, we send in the cleavage enzyme, which cuts off the terminator region and the dye, leaving a normal base pair. We can then start again to sequence the next base pair.

•  In a single Illumina machine we have hundreds of millions of these clusters; cameras look at all of these dots and record how they change color over Nme, allowing you to determine the sequence of bases of millions of bits of DNA at once.

•  Sequencing method is actually prepy inefficient, however, the machine is capable of sequencing millions of fragments of DNA at once.

Inside the Illumina Machine

51

Flow Cell Imaging

A flow cell contains 8 lanes

Each lane contains three columns of Nles

Each column contains 100 Nles

20K to 30K clusters Each Nle is imaged four Nmes per cycle, which is one image per base

Conclusions

Sample PreparaNon

Sequencing

Assembly

Analysis

Fragments

Reads

ConNgs

“…the ability to determine DNA sequences is starNng to outrun the ability of researchers to store, transmit and especially to analyze the data.”

-‐ New York Times, November 30, 2011

What challenges are lei?

•  Amount of data is starNng to be overwhelm biologists and data analysis people (aka bioinformaNcs people) are in more demand

•  Personal health care is changing (already); i.e. 23andme, sequenom

•  Data acquisiNon is sNll difficult in some case and advancements are needed in this area

•  SNll cannot sequence some genomes

What will happen in your lifeNme?

“We used to think that our fate was in our stars. Now we know that, in large measure, our fate is in

our genes.” -‐Francis Crick

•  You’ll be able to sequence your genome and know the implicaNons of your genotype

•  Medical diagnosis will change •  Plant and crop producNon will be affected •  We’ll have an improved knowledge of ancestry of ourselves and other species

Documents

Genome&Sequencing&(Part1)& - Colorado State Universitycs680/Slides/lecture4.pdf · Sample&Preparaon & Fragments De-novo assembly requires higher coverage. At least 30x but upwards