Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Chapter 7
DNA Sequencing and the Evolution of the
�Omics�
Sequencing Important component of many projects: there are several �generations� of sequencing methods
Data produced –> extensive databases
Two methods were developed in 1977 - First generation methods: 1) Sanger 2) Maxam and Gilbert
Both use same basic approaches
Sanger still used, but relatively expensive compared to second- and third-generation methods
Each generation type has pros and cons
First-Generation Sequencing
Computer tools needed to analyze data
DNA and RNA can be sequenced cDNA –> less information WHY?
Length that can be sequenced in a single first-generation method (Sanger) reaction varies from 200 to ~ 1000 bases
Vectors can hold different insert sizes
First-Generation Sequencing
Yeast artificial chromosomes = up to 1 million bp inserts
Cosmids can contain 30,000 to 45,000 bp inserts (30-40 kb)
Cloned DNA thus must usually be subcloned for Sanger sequencing
Dideoxy or Chain-terminating Method
Developed by Sanger
Involves de novo synthesis of labeled DNA fragments from a ss DNA template
Denature ds DNA by heating Clone it into a vector that produces ss DNA
SS DNA serves as template for synthesis of a new labeled DNA strand
Dideoxy or Chain - terminating Method
Labeling originally by 32P or 35S
Because DNA synthesis is involved, DNA polymerase, labeled dNTPs, a primer and ddNTPs are needed in Sanger sequencing
Structure of labeled dNTP with 35S
Dideoxy or Chain - terminating Method Sanger Sequencing
Dideoxy or Chain - terminating Method
Fluorescence labeling is now used and reads do not involve electrophoresis.
Acrylamide gel photo of radiolabeled DNA
Maxam - Gilbert Method: Another First-generation method
Uses chemical reagents to generate base - specific cleavages of DNA
Less used today because chemicals are toxic and method is labor intensive
Advantage: sequences are obtained from original DNA molecule (not synthesized copy)
Maxam - Gilbert Method
Need: pure DNA cut by restriction endonucleases –> fragments of a specific length and with known sequences at one end
Each fragment is then labeled at one end with a 32 P - phosphate group so that 4 reactions can be carried out
Maxam - Gilbert Method
Result is a set of end - labeled fragments of different lengths that show up as a ladder of bands on a gel
Each reaction is limited so not all Gs (or Ts, Cs, or As) are modified in a reaction
The 4 different reaction samples were run side by side on a sequencing gel and visualized by autoradiography
Maxam and Gilbert Chemical Cleavage Method
Shotgun Strategy DNA is digested with a restriction enzyme and
subfragments are cloned and sequenced
A computer is used to determine how fragments (CONTIGS) overlap –> original sequence
Shotgun method may under represent some fragments, sequencing must be redundant
Sequencing Strategies Directed
Random
Sequencing by the PCR
PCR allows a specific segment of DNA to be amplified a million - fold or more
Fragments can be amplified from genomic DNA or RNA (cDNA)
REQUIRES primers to flank region to be amplified Several methods developed to sequence PCR products
Automated DNA Sequencers
Fluorescent labeling made early large - scale genome projects feasible (gels no longer used, radiolabeling no longer used)
DNA isolation, cloning, PCR, prepare sequencing reactions, purify DNA, separate and detect labeled DNA fragments
Many large - scale facilities used random shotgun phase and a directed finishing phase to complete genome analysis
Automated DNA Sequencers
Automation has reduced costs of Sanger sequencing: $0.20 to 0.30 per base if accuracy is less than one error in 10,000 bases
During 2000, Drosophila Genome Project was completed, using Sanger sequencing, as well as the Human Genome Project
Sequencing Data
Huge amounts of data
Computers are essential for assembly of sequences and analysis
Sequences can be searched to discover
tandem repeats and inverted repeats open reading frames (ORFs) similarities with other DNA sequences in databases introns and exons transposable elements
Homology
Controversial term: multiple definitions
Usually assumes organisms have descended from a common ancestor
Percent homology NOT correct unless descent is involved
Better to use percentage SIMILARITY
DNA Sequence Data Banks
DNA Data Bank of Japan (DDBJ)
European Molecular Biology Laboratory Nucleotide Sequence Data Library (EMBL)
GenBank Genetic Sequence Data Bank at NCBI
Database subsets: mt, promoters, proteins, genomes, introns, restriction endonucleases etc.
Drosophila Genome Project
Part of the Human Genome Project
Both controversial: 1st �big� biology project Time Resources (individuals vs. teams) Necessary ? (substantial amount already
known about D. melanogaster�s genome)
Drosophila Genome Project
Drosophila Genome 180 Mb, 4 pr chromosomes, X, Y and 3 pr autosomes One - third is heterochromatin
Cytogenetic map available (5000 bands) Approximately 3800 genes already mapped
In situ hybridization of salivary gland chromosomes –> 3000 transcription units
1300 genes already cloned & sequenced
Drosophila Polytene Chromosomes
Light micrograph of stained salivary gland chromosomes: X, 2, 3 and 4 are joined at centromeres (within circle)
Cytogenetic Map on Polytene X
Fine Map of Drosophila salivary gland X chromosome
Original Drosophila Genome Project
Several steps Physical map as a basis for sequencing and detailed functional
studies: overlapping clones for which info is available on sequences at ends and location on chromosomes
Feasibility studies for large - scale sequencing, focusing on regions of great biological interest (3 megabases of contiguous sequences within 3 yrs)
Develop bioinformatic techniques to identify coding sequences and analyze data
ACTUAL Project
Completed more quickly and by a different strategy
D. melanogaster second multicellular organism after C. elegans to have complete genome sequenced
Initial project initiated in 1990, and only partly completed in 1996 when Venter et al. proposed using the �shotgun� method
Shotgun cloning had never been attempted with such a complex genome: Venter et al. used MASSIVE Sanger sequencing methods
Celera Sequencing and Analysis
ACTUAL Drosophila Genome Project
Began in May 1999 at Celera: completed by late fall of 1999 !!!!!
This created controversy and discord between Venter and the Drosophila genome consortium
Published in Science in March 2000
Major milestone for insect molecular genetics
Joint endeavor with Berkeley Drosophila project
Drosophila Genome Project
Sequencing only first step: what are the sequences (coding, noncoding, introns, exons, centromeres, telomeres) and what do the �genes� do?
Genome analysis called �annotating�, uses different methods
Accuracy of different methods assessed by GASP, Genome Annotation Assessment Project, using a well studied region
GASP
Coding regions better identified (95% success)
Correct intron / exon structures (40%)
Half of genes recognized and assigned functions by homology with known genes
Promoter sequences highly inaccurate < 1/3 correctly identified
Annotation methods have improved since then
Surprises
13,600 genes identified initially
Fewer than in C. elegans
Expected number : 30,000
Overlapping genes may lead to higher count Many genes left to be studied (despite previous
efforts)
Surprises Drosophila surprisingly relevant to study of
genes and pathways in tumor formation and development in humans
At least 76 Drosophila genes homologous to mammalian cancer genes
Furthermore, 178 (62%) of known human
disease genes appear conserved including genes causing Alzheimer�s disease, Huntington�s, Duchenne muscular dystrophy, Parkinson�s
Bioinformatics Current analysis methods are not completely
accurate in identifying structural and functional genes
What is a gene? Sequences encoding a protein, encoding RNA, producing a phenotype?
Computer skills are improving
Isochores –> 300-kb segments that are homogeneous on basis of GC frequencies usually are rich in genes
Bioinformatics Splice sites and junctions (introns and exons) can be
difficult to identify
Start and stop codons can be useful in predicting exons but reading frame must be known
As more genes are identified in other organisms, homologous genes can be found in Drosophila and other insects It is getting easier However, it assumes that similar sequences = similar function, which MAY NOT be true
Next-Generation Methods Newer sequencing methods have revolutionized
genetics
Second-generation or Next-generation methods were developed Allow high-throughput Less expensive than Sanger sequencing
Limitations: sequenced produced are shorter than Sanger sequencing so ASSEMBLY can be more difficult
Next-Generation Methods Platform Sequencing by synthesis Read length
Roche 454 Pyrosequencing 500-100 bp
Illumina (Solexa) Reversible terminators 20-40 bp
SOLiD Ligase 35 bp
Polonator Ligase 13 bp
HeliScope Polymerase 30 bp
Next-Generation Methods Roche (454) was first NextGen platform
Called pyrosequencing because it involves incubating DNA-bearing beads with Bacillus stearothermophilus DNA polymerase, ss binding protein, ATP and luciferase
When incorporation of a nucleotide occurs pyrophosphate is released, resulting in a burst of light detectable by recording device
This method cannot sequence long strings of same bases
Cost: 1000 bp for $0.05
Next-Generation Methods Illumina (Solexa) uses bridge PCR to amplify
sequences Fluorescent labels identify each nucleotide
Read lengths ca. 20-100 nt
Cost: 1000 bp for $0.002
Next-Generation Methods Applied Biosystems SOLiD sequencer
Results in reads of ca. 35 bp
Read lengths ca 20-100 nt
Cost: 1000 bp for $0.002 [If you want to learn details of the biochemistry of these
methods, go to the company websites]
Third-Generation Methods
Sequencing methods continue to be developed so that sequencing is faster, cheaper and provides longer reads
The ultimate goal is to produce the $1000 genome for
humans (which will provide less expensive insect genomes, as well)
Several methods: Ion Torrent, Single Molecule Real Time
Sequencer, Nanopore Sequencing
Bioinformatics Methods of analyzing sequenced genomes continue to
improve
Genome annotation is challenging: The $1000 genome can result in the $100,000 analysis (or more)
With shorter read lengths, assembly (ordering reads into scaffolds of longer and longer length) is more difficult
Many people sequencing arthropod genomes have limited bioinformatics training, but a �point-and click� process is not yet available for such novices
Bioinformatics The amount of data being produced is stressing the system
A single sequencing run can produce as much data as did entire genome centers a few years ago
Gene ontology (GO) is a bioinformatics project with the goal of standardizing the representation of gene and gene-product attributes across species and databases
The goal is to provide a specific vocabulary of terms for genes and gene products: Organized into Molecular function, Biological process, Cellular component
Bioinformatics Gene Ontology: the model organisms (human, yeast and
Drosophila) have been annotated using GO terms
This has become a standard for new genomes
HOWEVER, until FUNCTION of a �gene� is documented with functional analysis, an annotated gene is SIMILAR but this does not necessarily equate to the same function
Evolution can modify gene functions
Other Insect Genomes Anopheles gambiae, vector of malaria, had its
genome sequenced (2002)
Cost ca. $10 million using Sanger sequencing
Other genomes completed include Acromyrmex echinatior, Acrythosiphon pisum, Aedes aegypti, Apis mellifera, Bombyx mori, Culex quinquefasciatus, Danaus plexippus, Glossina species, Heliconius melpomene, Ixodes scapularis, Metaseiulus occidentalis, Nasonia species, Pediculus humanus, Pogonomyrmex barbatus, Rhodnius prolixis, Tetranychus urticae, Tribolium castaneum, Varroa
Other Insect Genomes This list is incomplete and will continue to increase Most were sequenced using Next-Gen sequencing
methods
Project to sequence 5000 arthropod genomes is underway (i5K)
Should result in a transformation of our understanding of insect biology and evolution, as well as to manage pests
What have we learned? Three mosquitoes sequenced: genome sizes vary,
details found in VectorBase Culex quinquefasciatus has expansion of olfactory and
gustatory receptors, salivary gland genes and detoxification genes
Bombyx mori has 1874 genes related to silk production Apis mellifera has few TEs and fewer genes for innate
immunity and detoxification enzymes: related to social behavior?
What have we learned? Tetranychus urticae feeds on > 250 plants and produces
silk. It has a small genome (90 Mbp), but many detoxification gene families – perhaps associated with broad host range
Carotenoid biosynthesis genes in genome were horizontally transferred from fungi to the mite, as it was in the pea aphid (below)
Acyrthosiphon pisum has many gene duplications and lost some genes; duplications include sugar-transporter proteins, amino-acid transport, antiapoptosis genes, perception of smell, and olfactory behavior
Gene losses include defense response, immune response, taste perception, antimicrobial responses
What have we learned? Danaus plexippus has a full repertoire of the circadian clock
and expanded chemoreceptors, genes for defense against cardenolide glycosides
Heliconoius species (3) analysis indicates this genus of butterflies hybridizes and exchanges genes that provide protective color patterns
7 complete ant genomes (Atta cephalotes, Acromyrmex
echinatior, Linepithema humile, Pogonomyrmex barbatus, Harpegnathos saltator, Camponotus floridanus, Solenopsis invicta) and more on the way
Of special interest: origin of eusociality
What have we learned? Rhodnius prolixus, a vector of Chagas� disease, may
provide new methods of control
Ixodes scapularis is a vector of Lyme disease and has a huge and complex genome (in process of being published)
Drosophila species (12) also sequenced in order to
compare evolution within the genus
What do you need to do to sequence your insect?
Ideally, you will know the size of the genome You will be able to develop an inbred line (or have sufficient
DNA to sequence the genome from a single haploid individual) so that assembly is easier
Develop an effective DNA extraction protocol to obtain
clean DNA with little fragmentation Annotate the DNA, using automated and manual methods
What do you need to do to sequence your insect?
Ideally, you will have a transcriptome to aid in delimiting
exons, introns, and splice variants Not all genome products are equal: quality can vary Standard draft, High-quality draft, Improved high-quality draft, Annotation-
directed improvement quality draft, Nontiguous Finished, Finished The annotations are provisional Up to half or one-third of your �genes� will have no
sequence similarity (= orphan genes)
What do you need to do to sequence your insect?
Functional gene analysis need to confirm gene function Once the genome is published, many years of work may be
necessary to understand the biology, behavior, evolution of this species
TEs as Agents of Genome Evolution
Genome analyses allow better understanding of extent and function of TEs
�natural genetic engineering systems� Kidwell and Lisch 2001
No longer just �selfish� or �junk�
TEs carry costs –> deleterious mutations Abundant and ancient components of genomes
TEs as Agents of Genome Evolution
TEs can acquire a functional role
HET-A and TART retrotransposons are telomeres in D. melanogaster
TEs cause inversions in Drosophila spp
Inversions �tie up� gene combinations to maintain useful combinations
TEs can be activated by environmental and population
factors –> increased variability
TEs as Agents of Genome Evolution
TEs have provided novel regulatory regions to preexisitng host genes
Allow new proteins to be developed
Function more like symbiosis than parasitism ?
TEs and Genome Evolution
Three outcomes possible
Co evolution of TE - derived mechanisms to minimize negative effects of TEs on their hosts (TE self - regulation, tissue specificity, targeting)
Evolution of host - defense mechanisms (suppressors)
Evolution of new and altered functions of TEs in hosts (regulatory, structural, enzymatic)
Transcriptomics
Transcriptome analyses: Analysis of transcripts of genes, including large and small RNAs
Conducted to discover genes using Sanger or Next-Gen sequencing methods
And to annotate coding and noncoding regions of a sequenced genome
Often conducted before sequencing a genome
Metagenomics
Sampling genomes of a community of microorganisms inhabiting a common environment: Useful for analysis of symbionts of arthropods
Culture-independent method of identifying microbes Can also provide clues as to function
Proteomics
Proteomics: the genome-wide analysis of proteins
Characterization of proteins and their posttranslational modifications
Comparison of protein levels and types
Studies of protein-protein interactions
Drosophila: 2297 proteins in 10,969 interactions (Guruharsha et al. 2011)
Proteomics
Databases allow comparisons of
Sequence similarity Protein function Structure (secondary, quaternary) Similarities to other proteins Diseases associated with protein deficiencies Posttranslational modifications
Functional Genomics
Assignment of function to genes, including understanding the organizational control of genetic pathways that make up the physiology of an organism
Can use DNA microarrays or transcriptomics
TILLING (targeting induced local lesions in genomes): mutagenesis with a chemical mutagen followed by identification of single-base mutations: a type of reverse genetics (analysis from genotype to phenotype)
Structural Genomics
Large-scale analysis of protein structures and functions based on gene sequences
Developed after genome projects began Attempts to determine sequence, structure and function in
order to predict unknown structures by homology modeling
Probably more difficult than genome analysis
Comparative Genomics
Comparing whole genomes to understand how genomes evolve
How many protein families are there? How many gene duplications? How similar are the protein domains and families? About 30% of all proteins are orphan genes, with no
homology to known genes: these could be the most interesting ?
Interactomes or Reactomes
What are the networks of protein-protein interactions?
How do signaling cascades affect cell biology?
Functional Genomics The assignment of function to genes, including
the organization and control of genetic pathways that make up the physiology of an organism
May use gene chips to measure mRNA abundance for tens of thousands of genes simultaneously, resulting in
�piles of information but only flakes of
knowledge�
The Post - Genomic Era
For past 50 yrs, biology has become ever more �reductionist�
Reductionist approaches allow us to gain detailed information about gene structure, function, regulation, expression
Now biology �is in the midst of an intellectual and experimental sea change�
The Post - Genomic Era
� The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks���(Leroy Hood, 2000)
In the future it may be possible to monitor simultaneously
the expression of genes at the RNA or protein level, all possible protein – protein interactions, all alleles of all genes that affect a trait and all protein-binding sites in a genome
The Post - Genomic Era
An integrative approach to biology provides new challenges to biologists
Mathematical models and computer simulations may
be needed to study the integrated function of multiple genes
Bioinformatics methods and systems analyses more important
Emergent properties -- properties that arise from the whole rather than the individual parts