Upload
tranhanh
View
214
Download
1
Embed Size (px)
Citation preview
Genome Sequencing (Part 1)
Lecture 4: August 30, 2012
Review from Last Lecture
De novo vs. Re-‐sequencing • De novo assembly (“from the beginning”) implies that you have no prior knowledge of the genome. No reference, no conNgs, only reads.
• Re-‐sequencing assembly assumes you have a copy of the reference genome (that has been verified to a certain degree).
• The programs that work for re-‐sequencing will not work for de novo and vice versa. However, both can create copies of the genome.
De novo vs. Re-‐sequencing
Sample PreparaNon
Fragments
Re-sequencing (LOCAS, Shrimp) requires 15x to 30x coverage. Anything less and re-sequencing programs will not produce results or produce questionable results.
Sample PreparaNon
Fragments
De-novo assembly requires higher coverage. At least 30x but upwards to 100x’s coverage. Most de novo assemblers require paired-end data.
IntroducNon and History
Sample PreparaNon
Sample PreparaNon
Fragments
Sample PreparaNon
Sequencing
ACGTAGAATCGACCATG
GGGACGTAGAATACGAC
ACGTAGAATACGTAGAA
Reads
Fragments
Next GeneraNon Sequencing (NGS)
Sample PreparaNon
Sequencing
Assembly
ACGTAGAATACGTAGAAACAGATTAGAGAG…
ConNgs
Fragments
Reads
ACGTAGAATCGACCATG GGGACGTAGAATACGAC
ACGTAGAATACGTAGAA
Sample PreparaNon
Sequencing
Assembly
Analysis
Fragments
Reads
ConNgs
Sample PreparaNon
Sequencing
Assembly
Analysis
Fragments
Reads
ConNgs
Our focus for today’s lecture: 1. Comparison of sequencing
plaXorms 2. Details of sample preparaNon 3. DefiniNons and terminologies
concerning data and sequencing plaXorms
Landmarks in Sequencing Efficiency (bp/person/year)
Year Event
1870 Miescher: Discovers DNA
1940 Avery: Proposes DNA as “GeneNc Material”
1953 Watson & Crick: Double Helix Structure of DNA
1 1965 Holley: Sequenced transfer RNA from Yeast
1,500 1977 Maxam & Gilbert: "DNA sequencing by chemical degradaNon” Sanger: “DNA sequencing with chain-‐terminaNng inhibitors”
1980 Messing: DNA cloning
15,000 1981 Messing: Messing and his colleagues developed “shotgun sequencing” method
25,000 1986 Hood et al.: ParNal AutomaNon
1987 ABI markets the first sequencing plaXorm, ABI 370
Landmarks in Sequencing Efficiency (bp/person/year)
Year Event
50,000 1990 NIH begins large-‐scale sequencing trials of bacteria genomes.
200,000 1995 Craig Venture and Hamilton Smith at the InsNtute for Genomic Research (TIGR) published the first complete genome of a free-‐living organism in Science. This marks the first use of whole-‐genome shotgun sequencing, eliminaNng the need for iniNal mapping efforts.
2001 A drai of the human genome was published in Science.
2001 A drai of the human genome was published in Nature.
50,000,000 2002 454 Life Sciences comes out with a pyrosequencing machine.
100,000,000 2008 Next generaNon sequencing machines arrive.
Huge 2011 Oxford Nanopore: 600 Million base pairs per hour.
Robert Holley and team in 1965
Watson and Crick
Messing: World’s most-‐cited scienNst
Francis and Collins: Private Human Genome project.
Next-‐Gen Sequencing PlaXorms
454/Roche GS-‐20/FLX (2005)
PacBio RS (2009-‐2010) 3rd generaNon?
Illumina HISeq (2007)
Comparison of NGS PlaXorms
Technology Reads per run Average Read Length
bp per run Types of errors
454 (Roche) 400,000 250-‐1000bp 70 Million SubsNtuNon
SoLID (ABI) 88-‐132 Million 35bp 1 Billion
Illumina HISeq 150 Million 100 – 200bp 15 Billion SubsNtuNon with exponenNal increase
PacBio 45,000 1000-‐2000bp 45 Million InserNons and deleNons
\
Sequencing Methods and Terminology
Sanger Sequencing
• The key principle of the Sanger method was the dideoxynucleoNde triphosphates (ddNTPs) as DNA chain terminators.
• These ddNTPs will also be radioacNvely for detecNon in automated sequencing machines.
• PosiNves: longer reads (600 to 1000 bp). • NegaNves: poor coverage (6x), expensive, inaccurate.
• SNll commonly used for small scale sequencing.
Sanger Sequencing Video
Sanger Sequencing SHEAR DNA target sample
Sanger Sequencing SHEAR DNA target sample
A A A A
C G T
C G T
C G T
C G T
Close each fragment many times.
Sanger Sequencing
28
SHEAR DNA target sample
A A A A
C G T
C G T
C G T
C G T A C
G
T
Sanger Sequencing
A C G T A
DNA polymerase Primer
Sanger Sequencing
A C G T A
DNA polymerase Primer
Primer
DNA polymerase
A C G T A
Sanger Sequencing
A C G T A
Primer
DNA polymerase
A
A
A
G
G
C
C
C T
C T A
C T
G
Sanger Sequencing
A C G T A
Primer
DNA polymerase
A
A
A
G
G
C
C
C T
C T A
C T
G
G
Sanger Sequencing
A C G T A
Primer A
A
A
G
G
C
C
C T
C T A
C T
C G
G
Sanger Sequencing
A C G T A
Primer A
A
A
G
G
C
C
C T
C T A
C T
G A
C G
G
Sanger Sequencing
A C G T A
Primer A
A
A
G
G
C
C
C T
C T A
C T
T
Sanger Sequencing
Primer A
A
A
G
G
C
C
C T
C T A
C T
Sanger Sequencing
Primer A
A
A
G
G
C
C
C T
C T A
C T
ConNnue unNl all strands of DNA have undergone this reacNon. If you choose the reagents correctly then you should have all possible A-‐terminated strands; resulNng in sequences of varying lengths.
Sanger Sequencing
Sanger Sequencing
In the radioacNve gel, the longer DNA fragments move to the bopom and the shorter ones move to the top. Aierward the sequence can be read off by going from top to bopom.
To recap… • Sanger Sequencing:
– Run a PCR reacNon in the presence of a bunch of ddNTPs, with each different base pair dyed a different color.
– Measure the length and color of the resulNng fragments of DNA, and use that to work out the sequence.
• Requires a lot of space and Nme: you need a place to run the reacNon, and then you need a capillary tube or a gel to determine the length of the DNA. – You could only run perhaps a hundred of these reacNons at any one Nme.
– There are 3 billion base pairs of DNA in the human genome, meaning about 6 million 500-‐base pair fragments of DNA.
40
Celera Sequencing (2001)
• 300 ABI DNA sequencing plaXorms • 50 producNon staff • 20,000 square feet of wet lab space • 1 million dollars / year for electrical service • 10 million dollars in reagents
Total cost of human genome: 2.7 Billion dollars
Celera Sequencing (2001)
• 300 ABI DNA sequencing plaXorms • 50 producNon staff • 20,000 square feet of wet lab space • 1 million dollars / year for electrical service • 10 million dollars in reagents
Current cost of human genome: < 10,000 $
• Second GeneraNon sequencing techniques overcome the restricNons by finding ways to sequence the DNA without having to move it around.
• You sNck the bit of DNA you want to sequence in a liple dot, called a cluster, and you do the sequencing there; as a result, you can pack many millions of clusters into one machine.
Second/Next GeneraNon Sequencing
Sequencing a strand of DNA while keeping it held in place is tricky, and requires a lot of cleverness.
Illumina Sequencing Pipeline 1. Sample preparaNon (1-‐5 days)
ligate adapters
2. Cluster generaNon on flow cell
3. Sequencing and Imaging (1 week)
(1.5 days)
4. Analysis (days, months, years…)
Illumina Sequencing: Video
We mulNply up the template stand, i.e. the bit of DNA that we are sequencing, and sNck on a few bases of ‘adaptor sequence’; this sequence sNcks on to complementary bits of DNA stuck to a surface, which holds the DNA in place while we sequence it:
We then flood the DNA with RT-‐bases. We also add a polymerase enzyme, which incorporates the RT-‐base into the new strand that is complementary to the template strand:
We then wash away all the RT-‐bases, leaving just those that were incorporated into the new strand; we can read off what base this is by looking at the color of the dye:
• Finally, we send in the cleavage enzyme, which cuts off the terminator region and the dye, leaving a normal base pair. We can then start again to sequence the next base pair.
• In a single Illumina machine we have hundreds of millions of these clusters; cameras look at all of these dots and record how they change color over Nme, allowing you to determine the sequence of bases of millions of bits of DNA at once.
• Sequencing method is actually prepy inefficient, however, the machine is capable of sequencing millions of fragments of DNA at once.
Inside the Illumina Machine
51
Flow Cell Imaging
A flow cell contains 8 lanes
Each lane contains three columns of Nles
Each column contains 100 Nles
20K to 30K clusters Each Nle is imaged four Nmes per cycle, which is one image per base
Conclusions
Sample PreparaNon
Sequencing
Assembly
Analysis
Fragments
Reads
ConNgs
“…the ability to determine DNA sequences is starNng to outrun the ability of researchers to store, transmit and especially to analyze the data.”
-‐ New York Times, November 30, 2011
What challenges are lei?
• Amount of data is starNng to be overwhelm biologists and data analysis people (aka bioinformaNcs people) are in more demand
• Personal health care is changing (already); i.e. 23andme, sequenom
• Data acquisiNon is sNll difficult in some case and advancements are needed in this area
• SNll cannot sequence some genomes
What will happen in your lifeNme?
“We used to think that our fate was in our stars. Now we know that, in large measure, our fate is in
our genes.” -‐Francis Crick
• You’ll be able to sequence your genome and know the implicaNons of your genotype
• Medical diagnosis will change • Plant and crop producNon will be affected • We’ll have an improved knowledge of ancestry of ourselves and other species