Nucleotide Sequence Databases

Nucleotide Sequence Databases

Your guide to genes & genomes


• First generation– GenBank is a representative example– started as sort of a museum to preserve

knowledge of a sequence from first discovery– great repositories, particularly for long-term study

of bioinformatic data– flat files; not built for (and not great at) querying


• Second generation:– Entrez gene is an example– information is gene-centric (not just sequence-

centric)– all sequence information for a given gene can be

found in one place


• Third generation:– Ensembl is a good example– Information is organized

around whole genomes; not only a specific gene’s structure, but its context:• position of this gene relative

to others• strand orientation• how gene relates to presence

or absence of biochemical functions in organism

Prokaryotes (& Archaea)• microscopic

organisms• single cell• no nucleus• simple genome: – single, circular DNA

molecule– 600,000 – 8 million

base pairs• 70% of genome

codes for proteins

Prokaryotes (& Archaea)

• genes don’t overlap• no introns; mRNA is

collinear with gene sequence

• protein sequences derived by translating longest ORF (ATG to STOP) spanning gene-transcript sequence source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm

Thought for today …

source: http://www.scicomics.com/uploads/prokaryote.jpg

Eukaryotes• way more complicated– genes found in cell

nucleus– genome size: 10 million

– 670 million base pairs• much lower gene

density than prokaryotes: in human chromosomes, about one gene for every 100,000 base pairs

source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm

Eukaryotes

source: http://www.cit.gu.edu.au/~anthony/dungeon/balcony/

• much less efficient than prokaryotes; less than 5% of human genome codes for protein

• genes transcribed after a promoter region; but process may be strongly influenced by sequence elements relatively far away

Eukaryotes

• Gene sequences and mRNA/protein sequences not collinear; only exons are retained in mature mRNA that encodes protein

• A single gene may (and often does) exhibit more than one mRNA and protein form

GenBank

• First example: prokaryotic gene– point your browser to:http://www.ncbi.nlm.nih.gov/entrez– choose Nucleotide from the Search pull-down

menu– in For box, type X01714 and click Go– Click the link labeled X01714– Can “Send To Text” if you want to save the file

GenBank fields

• LOCUS– size of sequence (in base pairs)– nature of molecule (e.g. DNA or RNA)– topology (linear or circular)

• DEFINITION: brief description of gene• ACCESSION: unique identifier for this (and

some other) databases• VERSION: lists synonymous or past ID

numbers

GenBank fields

• KEYWORDS: list of terms related to entry; can be used for keyword searching for related data

• SOURCE: common name of relevant organism• ORGANISM: complete id, with taxonomic

classification– note that ORGANISM is indented under SOURCE;

this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE

GenBank fields

• REFERENCE: credits author(s) who initially determined the sequence; includes subsections:– AUTHOR– TITLE– JOURNAL– PUBMED

• COMMENT: free-formatted text that doesn’t fit in another category

GenBank fields

• FEATURES: table describing gene regions and associated biological properties– source: origin of specific regions of sequence; useful

for distinguishing cloning vectors from host sequences– promoter: precise coordinates of promoter element

in the sequence; may be more than one of these– misc feature: in this example, indicates (putative)

location of transcription start (mRNA synthesis)– RBS (ribosome binding site): location of last upstream

element– CDS (CoDing Segment): describes the ORF

GenBank fields: FEATURES: CDS

• gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA)

• several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence

• /translation section gives computer translation of sequence into amino acid sequence

Last Section: sequence itself• This is the most important section in terms of

analysis using other tools• Can isolate just this section and save the file, as

follows:– Choose FASTA from the Display pull-down menu (top

of page)– Choose Text in the Send To pull-down menu– Use File/Save As to save the file

• use “Text” as file type• give the file a name that you’ll know to associate with this

particular sequence

Example 2: eukaryotic mRNA• Can obtain this example by searching Nucleotide

database for U90223• Similar to prokaryote example, because we’re looking

at a direct coding sequence for a protein – not DNA, in other words

• Notes on example:– KEYWORD field is empty: this is an example of an

incomplete annotation– remember, you’re looking at a primary database!– FEATURES field contains some new terms:

• sig_peptide: location of mitochondrial targeting sequence• mat_peptide: exact boundaries of mature peptide

Example 3: Eukaryotic gene• Can obtain this record by searching Nucleotide

for AF018430• General information:– LOCUS: same info as previous examples – note the

locus name is different from the accession number this time

– DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes

– SEGMENT: indicates this is the second of 4; you’d need all 4 to reconstruct the mRNA that codes for the protein

Eukaryotic gene: FEATURES section

• source subsection includes a /map section:– indicates chromosome (15)– arm (q means long arm)– cytogenic band (q21.1)


• gene subsection: describes how to reconstruct the mRNAs found in this and separate entries:– the strings that begin “AF” refer to the GenBank

entries (remember, this one was AF018430), and the numbers represent the nucleotide positions from the entries

– if a set of numbers (example: 1..1177) is NOT preceded by an entry indicator, it’s from the current entry

– The < and > signs indicate that the start and stop points are only approximate


• mRNA section: can be read in a similar manner to the gene section

• note that there are two mRNA sections (each followed by a CDS section)– first section describes mitochondrial RNA– second section describes nuclear RNA

• exon section: indicates position of exon(s) in sequence

Retrieving GenBank entries without accession numers

• Search Nucleotide for specific product you’re interested in; for example:human[organism] AND dUTPase[Protein name]– this search yields several entries; can click the Links link to

the right of one of these (AF018432) and choose Related Sequences from the pull-down that appears

– retrieves several more entries, some DNA and some mRNA– terms used in the titles of these entries can give us

additional search criteria:human[organism] AND “dUTP pyrophosphatase”[Title]– yields somewhat different set of entries

Documents

Nucleotide Sequence Databases