Upload
china
View
80
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Nucleotide Sequence Databases. Your guide to genes & genomes. Nucleotide Sequence Databases. First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery - PowerPoint PPT Presentation
Citation preview
Nucleotide Sequence Databases
Your guide to genes & genomes
Nucleotide Sequence Databases
• First generation– GenBank is a representative example– started as sort of a museum to preserve
knowledge of a sequence from first discovery– great repositories, particularly for long-term study
of bioinformatic data– flat files; not built for (and not great at) querying
Nucleotide Sequence Databases
• Second generation:– Entrez gene is an example– information is gene-centric (not just sequence-
centric)– all sequence information for a given gene can be
found in one place
Nucleotide Sequence Databases
• Third generation:– Ensembl is a good example– Information is organized
around whole genomes; not only a specific gene’s structure, but its context:• position of this gene relative
to others• strand orientation• how gene relates to presence
or absence of biochemical functions in organism
Prokaryotes (& Archaea)• microscopic
organisms• single cell• no nucleus• simple genome: – single, circular DNA
molecule– 600,000 – 8 million
base pairs• 70% of genome
codes for proteins
Prokaryotes (& Archaea)
• genes don’t overlap• no introns; mRNA is
collinear with gene sequence
• protein sequences derived by translating longest ORF (ATG to STOP) spanning gene-transcript sequence source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm
Thought for today …
source: http://www.scicomics.com/uploads/prokaryote.jpg
Eukaryotes• way more complicated– genes found in cell
nucleus– genome size: 10 million
– 670 million base pairs• much lower gene
density than prokaryotes: in human chromosomes, about one gene for every 100,000 base pairs
source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm
Eukaryotes
source: http://www.cit.gu.edu.au/~anthony/dungeon/balcony/
• much less efficient than prokaryotes; less than 5% of human genome codes for protein
• genes transcribed after a promoter region; but process may be strongly influenced by sequence elements relatively far away
Eukaryotes
• Gene sequences and mRNA/protein sequences not collinear; only exons are retained in mature mRNA that encodes protein
• A single gene may (and often does) exhibit more than one mRNA and protein form
GenBank
• First example: prokaryotic gene– point your browser to:http://www.ncbi.nlm.nih.gov/entrez– choose Nucleotide from the Search pull-down
menu– in For box, type X01714 and click Go– Click the link labeled X01714– Can “Send To Text” if you want to save the file
GenBank fields
• LOCUS– size of sequence (in base pairs)– nature of molecule (e.g. DNA or RNA)– topology (linear or circular)
• DEFINITION: brief description of gene• ACCESSION: unique identifier for this (and
some other) databases• VERSION: lists synonymous or past ID
numbers
GenBank fields
• KEYWORDS: list of terms related to entry; can be used for keyword searching for related data
• SOURCE: common name of relevant organism• ORGANISM: complete id, with taxonomic
classification– note that ORGANISM is indented under SOURCE;
this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE
GenBank fields
• REFERENCE: credits author(s) who initially determined the sequence; includes subsections:– AUTHOR– TITLE– JOURNAL– PUBMED
• COMMENT: free-formatted text that doesn’t fit in another category
GenBank fields
• FEATURES: table describing gene regions and associated biological properties– source: origin of specific regions of sequence; useful
for distinguishing cloning vectors from host sequences– promoter: precise coordinates of promoter element
in the sequence; may be more than one of these– misc feature: in this example, indicates (putative)
location of transcription start (mRNA synthesis)– RBS (ribosome binding site): location of last upstream
element– CDS (CoDing Segment): describes the ORF
GenBank fields: FEATURES: CDS
• gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA)
• several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence
• /translation section gives computer translation of sequence into amino acid sequence
Last Section: sequence itself• This is the most important section in terms of
analysis using other tools• Can isolate just this section and save the file, as
follows:– Choose FASTA from the Display pull-down menu (top
of page)– Choose Text in the Send To pull-down menu– Use File/Save As to save the file
• use “Text” as file type• give the file a name that you’ll know to associate with this
particular sequence
Example 2: eukaryotic mRNA• Can obtain this example by searching Nucleotide
database for U90223• Similar to prokaryote example, because we’re looking
at a direct coding sequence for a protein – not DNA, in other words
• Notes on example:– KEYWORD field is empty: this is an example of an
incomplete annotation– remember, you’re looking at a primary database!– FEATURES field contains some new terms:
• sig_peptide: location of mitochondrial targeting sequence• mat_peptide: exact boundaries of mature peptide
Example 3: Eukaryotic gene• Can obtain this record by searching Nucleotide
for AF018430• General information:– LOCUS: same info as previous examples – note the
locus name is different from the accession number this time
– DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes
– SEGMENT: indicates this is the second of 4; you’d need all 4 to reconstruct the mRNA that codes for the protein
Eukaryotic gene: FEATURES section
• source subsection includes a /map section:– indicates chromosome (15)– arm (q means long arm)– cytogenic band (q21.1)
Eukaryotic gene: FEATURES section
• gene subsection: describes how to reconstruct the mRNAs found in this and separate entries:– the strings that begin “AF” refer to the GenBank
entries (remember, this one was AF018430), and the numbers represent the nucleotide positions from the entries
– if a set of numbers (example: 1..1177) is NOT preceded by an entry indicator, it’s from the current entry
– The < and > signs indicate that the start and stop points are only approximate
Eukaryotic gene: FEATURES section
• mRNA section: can be read in a similar manner to the gene section
• note that there are two mRNA sections (each followed by a CDS section)– first section describes mitochondrial RNA– second section describes nuclear RNA
• exon section: indicates position of exon(s) in sequence
Retrieving GenBank entries without accession numers
• Search Nucleotide for specific product you’re interested in; for example:human[organism] AND dUTPase[Protein name]– this search yields several entries; can click the Links link to
the right of one of these (AF018432) and choose Related Sequences from the pull-down that appears
– retrieves several more entries, some DNA and some mRNA– terms used in the titles of these entries can give us
additional search criteria:human[organism] AND “dUTP pyrophosphatase”[Title]– yields somewhat different set of entries