NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources

NC

BI

Fie

ldG

uid

e

A Minimal Guide to NCBI Nucleotide Resources

NC

BI

Fie

ldG

uid

e

Types of Databases

• Primary Databases

– Original submissions by experimentalists

– Content controlled by the submitter

• Examples: GenBank, SNP, GEO

• Derivative Databases

– Built from primary data

– Content controlled by third party (NCBI)• Examples: Refseq, TPA, RefSNP, UniGene, GEO

Datasets, NCBI Protein, Structure, Conserved

Domain

NC

BI

Fie

ldG

uid

eAccessing the Data: Entrez

all[filter]

NC

BI

Fie

ldG

uid

e

EBI

GenBankGenBank

DDBJDDBJ

EMBLEMBL

EMBLEMBL

Entrez

SRS

getentry

NIGNIGCIB

NCBI

NIHNIH

•Submissions•Updates •Submissions

•Updates

•Submissions•Updates

International Sequence Database Collaboration

NC

BI

Fie

ldG

uid

eGenBank: NCBI’s Primary Sequence Database

ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub

ftp://bio-mirror.net/biomirror/genbank

Release 142 June 2004 35,532,003 Records 40,325,321,348 Nucleotides >140,000 Species 153 Gigabytes 634 files

• full release every two months• incremental and cumulative updates daily• available only through internet• release notes: gbrel.txt

NC

BI

Fie

ldG

uid

eA GenBank Record

LOCUS NM_000588 924 bp mRNA linear PRI 07-APR-2003DEFINITION Homo sapiens interleukin 3 (colony-stimulating

factor, multiple)(IL3), mRNA.ACCESSION NM_000588VERSION NM_000588.3 GI:28416914KEYWORDS .

NC

BI

Fie

ldG

uid

eGenBank Record: Feature Table

/protein_id=“NP_000579.2”/db_xref=“GI:28416915 GenPept identifiers

NC

BI

Fie

ldG

uid

eGenBank Record, Con’t

NC

BI

Fie

ldG

uid

eSequence Revision History

NC

BI

Fie

ldG

uid

e

NM_000588

Sequence Revision History: choose records

NC

BI

Fie

ldG

uid

eDisplay and Save Options

NC

BI

Fie

ldG

uid

eFASTA format (NCBI)

NC

BI

Fie

ldG

uid

eAbstract Syntax Notation: ASN.1

FASTA Nucleotide

FASTAProtein

GenPept GenBank

ASN.1

NC

BI

Fie

ldG

uid

eBulk Divisions

• Expressed Sequence Tag– 1st pass single read cDNA

• Genome Survey Sequence– 1st pass single read gDNA

• High Throughput Genomic– incomplete sequences of genomic clones

• Sequence Tagged Site– PCR-based mapping reagents

• Batch submissions (email and ftp)• Inaccurate• Poorly characterized

NC

BI

Fie

ldG

uid

e

NCBI’sDerivative Sequence Databases

NC

BI

Fie

ldG

uid

ePrimary vs. Derivative Databases

GenBank

SequencingCenters

UniGene

RefSeq:LocusLink andGenomes Pipelines

RefSeq:Annotation Pipeline

Labs

Algorithms

Updated ONLY by submitters

EST UniSTS

STS

GSS

HTG

PRI ROD PLN MAM BCT

INV VRT PHG VRL

Curators

ATT GA

ATT

C

GA

C

GA

C

C

CATT

TAACT

Updated continuall

y by NCBI

RefSeq

NC

BI

Fie

ldG

uid

e

Entrez Protein query:

topoisomerase II alpha[title] AND human[organism]

Why Make Reference Sequences?

= AAC77388

splice variant

splice variant

splice variant

Δ = 5 aa

= P11388

RefSeq protein

NC

BI

Fie

ldG

uid

eRefSeq Benefits

• non-redundant, best representative

• updates to reflect current sequence

data and biology

• distinct, stable accession series

genomestranscripts

proteins

NC

BI

Fie

ldG

uid

eReference Sequence: RefSeq

Accession Sequence Type

NM_123456789 mRNANP_123456789 protein, from NM_NR_123456 non-coding RNAXM_123456 predicted mRNAXP_123456 predicted protein XR_123456 predicted non-coding RNAZP_12345678 predicted from NZ_

NC_123456 genomic, e.g., chromosomesNG_123455 genomic, incomplete region

NT_123456 genomic, BAC assemblyNW_123456 genomic, WGS assemblyNZ_ABCD12345678 genomic, WGS collection

blue=curated REFSEQ Key

http://www.ncbi.nlm.nih.gov/RefSeq/key.html

NC

BI

Fie

ldG

uid

e

RefSeq Status Codes

REVIEWED: by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features.

VALIDATED: in an initial review to provide the preferred sequence standard; not yet subjected to final review at which time additional functional information may be provided.

PROVISIONAL: the record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein.

PREDICTED: may represent an ab initio prediction or may be partially supported by other transcript data; the protein is predicted.

INFERRED: by genome sequence analysis.

MODEL: provided via automated processing and not subjected to individual review or revision between builds.

NC

BI

Fie

ldG

uid

e

Third Party Annotation (TPA) Database

• Annotations of existing GenBank sequences

• Allows for community annotation of genomes

• Direct submissions– BankIt – Sequin

NC

BI

Fie

ldG

uid

eOther Databases at the NCBI

• dbSNP nucleotide polymorphisms

• GEO Gene Expression Omnibus microarray and other

expression data

• GEO DataSets curated reports of GEO data

collections of biologically and mathematically

comparable GEO Samples.

• Structure imported structures (PDB) Cn3D viewer, NCBI

curation

• CDD conserved domain database protein families (COGs and

KOGs)

single domains (PFAM, SMART, CD)

NC

BI

Fie

ldG

uid

eNCBI’s SNP Database

• Primary and derivative (RefSNP)

• Single nucleotide polymorphisms

• Repeat polymorphisms

• Insertion-deletion polymorphisms

• 24 Species

• Over 11 million refSNPs (rsXXXXXXX)

NC

BI

Fie

ldG

uid

e

•Non-redundant

•Computational Analysis

BLAST hits to genome, mRNA, protein

RefSNP

NC

BI

Fie

ldG

uid

e

Using Entrez

An integrated database

search and retrieval system

Genomes

Taxonomy

Entrez: Database Integration

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structure

Word weight

VAST

BLASTBLAST

Phylogeny

NC

BI

Fie

ldG

uid

eHome Page: Global Entrez Portal

hfe

NC

BI

Fie

ldG

uid

eGlobal Entrez Search: HFE

NC

BI

Fie

ldG

uid

eEntrez Nucleotide: HFE218 records

Not HFE [Title]

NC

BI

Fie

ldG

uid

eSmarter Query

hfe[title] AND human[orgn]

39 records

Curated HFE splice variants(11 total)

NC

BI

Fie

ldG

uid

ehfe[title] AND human[orgn] (con’t)

Primary data

NC

BI

Fie

ldG

uid

eFinding Primary Sequences

• Entrez Nucleotide

99+% GenBank (primary data)

– srcdb ddbj/embl/genbank[properties] = 39,849,856 records

<1% RefSeq (curated data)

– srcdb refseq[properties] = 304,945 records

• Useful search terms in [Properties]:

– srcdb : source database (e.g., srcdb genbank[prop])

– gbdiv : GenBank division (e.g., gbdiv est[prop])

– biomol : biomolecule type (e.g., biomol mrna[prop])

NC

BI

Fie

ldG

uid

eDatabase Queries

#1 hfe 116#2 hfe[title] AND human[orgn] 42

#3 #2 AND srcdb refseq[prop] 11#4 #2 AND srcdb ddbj/embl/genbank[prop] 31

#5 #2 AND gbdiv pri[prop] 29#4 #2 AND gbdiv est[prop] 2

Primate division gbdiv pri[prop]EST division gbdiv est[prop]

NC

BI

Fie

ldG

uid

eMolecule Queries

#1 hfe 116#2 hfe[title] AND human[orgn] 42

#3 #2 AND biomol mrna[prop] 29#4 #2 AND biomol genomic[prop] 13

Genomic DNA biomol genomic[prop]cDNA biomol mrna[prop]

NC

BI

Fie

ldG

uid

eMore Queries…

RefSeq status, variants: reviewed RefSeqs with transcript variants

srcdb refseq reviewed[prop] AND has transcript variants[prop]

Gene symbol: human hemochromatosis (HFE)

hfe[sym] AND human[organism]

Disease and Gene Ontology: membrane proteins linked to cancer

integral to plasma membrane[gene ontology] AND cancer[dis]

Chromosome, Links: genes on human chromosome 2 with OMIM links

2[chromosome] AND gene omim[filter] AND human[organism]

Protein name: topoisomerase genes from Archaea

topoisomerase[gene/protein name] AND archaea[organism]

NC

BI

Fie

ldG

uid

eOther Entrez Databases

UniSTS: markers on the Genethon map of human chromosome 12

Genethon[Map Name] AND human[organism] AND 12[chromosome]

UniGene: rat clusters that have at least one mRNA

rat[organism] NOT 0[mrna count]

Structure: structures of bacterial kinases with resolutions below 2 Å

bacteria[organism] AND kinase AND 000.00:002.00[resolution]

SNP: uniquely mapped microsatellites on human chr2

microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn]

NC

BI

Fie

ldG

uid

eSearch by Sequence

NC

BI

Fie

ldG

uid

eRelated Sequences

Most similar

Least similar

NC

BI

Fie

ldG

uid

eSearch by Sequence: protein

NC

BI

Fie

ldG

uid

eBLink (BLAST Link)

NC

BI

Fie

ldG

uid

eBLink Output

NC

BI

Fie

ldG

uid

eBLink → Multiple sequence alignment

Documents

NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources