Upload
joseph-horton
View
271
Download
0
Embed Size (px)
Citation preview
NC
BI
Fie
ldG
uid
e
A Minimal Guide to NCBI Nucleotide Resources
NC
BI
Fie
ldG
uid
e
Types of Databases
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)• Examples: Refseq, TPA, RefSNP, UniGene, GEO
Datasets, NCBI Protein, Structure, Conserved
Domain
NC
BI
Fie
ldG
uid
eAccessing the Data: Entrez
all[filter]
NC
BI
Fie
ldG
uid
e
EBI
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
International Sequence Database Collaboration
NC
BI
Fie
ldG
uid
eGenBank: NCBI’s Primary Sequence Database
ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub
ftp://bio-mirror.net/biomirror/genbank
Release 142 June 2004 35,532,003 Records 40,325,321,348 Nucleotides >140,000 Species 153 Gigabytes 634 files
• full release every two months• incremental and cumulative updates daily• available only through internet• release notes: gbrel.txt
NC
BI
Fie
ldG
uid
eA GenBank Record
LOCUS NM_000588 924 bp mRNA linear PRI 07-APR-2003DEFINITION Homo sapiens interleukin 3 (colony-stimulating
factor, multiple)(IL3), mRNA.ACCESSION NM_000588VERSION NM_000588.3 GI:28416914KEYWORDS .
NC
BI
Fie
ldG
uid
eGenBank Record: Feature Table
/protein_id=“NP_000579.2”/db_xref=“GI:28416915 GenPept identifiers
NC
BI
Fie
ldG
uid
eGenBank Record, Con’t
NC
BI
Fie
ldG
uid
eSequence Revision History
NC
BI
Fie
ldG
uid
e
NM_000588
Sequence Revision History: choose records
NC
BI
Fie
ldG
uid
eDisplay and Save Options
NC
BI
Fie
ldG
uid
eFASTA format (NCBI)
NC
BI
Fie
ldG
uid
eAbstract Syntax Notation: ASN.1
FASTA Nucleotide
FASTAProtein
GenPept GenBank
ASN.1
NC
BI
Fie
ldG
uid
eBulk Divisions
• Expressed Sequence Tag– 1st pass single read cDNA
• Genome Survey Sequence– 1st pass single read gDNA
• High Throughput Genomic– incomplete sequences of genomic clones
• Sequence Tagged Site– PCR-based mapping reagents
• Batch submissions (email and ftp)• Inaccurate• Poorly characterized
NC
BI
Fie
ldG
uid
e
NCBI’sDerivative Sequence Databases
NC
BI
Fie
ldG
uid
ePrimary vs. Derivative Databases
GenBank
SequencingCenters
UniGene
RefSeq:LocusLink andGenomes Pipelines
RefSeq:Annotation Pipeline
Labs
Algorithms
Updated ONLY by submitters
EST UniSTS
STS
GSS
HTG
PRI ROD PLN MAM BCT
INV VRT PHG VRL
Curators
ATT GA
ATT
C
GA
C
GA
C
C
CATT
TAACT
Updated continuall
y by NCBI
RefSeq
NC
BI
Fie
ldG
uid
e
Entrez Protein query:
topoisomerase II alpha[title] AND human[organism]
Why Make Reference Sequences?
= AAC77388
splice variant
splice variant
splice variant
Δ = 5 aa
= P11388
RefSeq protein
NC
BI
Fie
ldG
uid
eRefSeq Benefits
• non-redundant, best representative
• updates to reflect current sequence
data and biology
• distinct, stable accession series
genomestranscripts
proteins
NC
BI
Fie
ldG
uid
eReference Sequence: RefSeq
Accession Sequence Type
NM_123456789 mRNANP_123456789 protein, from NM_NR_123456 non-coding RNAXM_123456 predicted mRNAXP_123456 predicted protein XR_123456 predicted non-coding RNAZP_12345678 predicted from NZ_
NC_123456 genomic, e.g., chromosomesNG_123455 genomic, incomplete region
NT_123456 genomic, BAC assemblyNW_123456 genomic, WGS assemblyNZ_ABCD12345678 genomic, WGS collection
blue=curated REFSEQ Key
NC
BI
Fie
ldG
uid
e
RefSeq Status Codes
REVIEWED: by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features.
VALIDATED: in an initial review to provide the preferred sequence standard; not yet subjected to final review at which time additional functional information may be provided.
PROVISIONAL: the record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein.
PREDICTED: may represent an ab initio prediction or may be partially supported by other transcript data; the protein is predicted.
INFERRED: by genome sequence analysis.
MODEL: provided via automated processing and not subjected to individual review or revision between builds.
NC
BI
Fie
ldG
uid
e
Third Party Annotation (TPA) Database
• Annotations of existing GenBank sequences
• Allows for community annotation of genomes
• Direct submissions– BankIt – Sequin
NC
BI
Fie
ldG
uid
eOther Databases at the NCBI
• dbSNP nucleotide polymorphisms
• GEO Gene Expression Omnibus microarray and other
expression data
• GEO DataSets curated reports of GEO data
collections of biologically and mathematically
comparable GEO Samples.
• Structure imported structures (PDB) Cn3D viewer, NCBI
curation
• CDD conserved domain database protein families (COGs and
KOGs)
single domains (PFAM, SMART, CD)
NC
BI
Fie
ldG
uid
eNCBI’s SNP Database
• Primary and derivative (RefSNP)
• Single nucleotide polymorphisms
• Repeat polymorphisms
• Insertion-deletion polymorphisms
• 24 Species
• Over 11 million refSNPs (rsXXXXXXX)
NC
BI
Fie
ldG
uid
e
•Non-redundant
•Computational Analysis
BLAST hits to genome, mRNA, protein
RefSNP
NC
BI
Fie
ldG
uid
e
Using Entrez
An integrated database
search and retrieval system
Genomes
Taxonomy
Entrez: Database Integration
PubMed abstracts
Nucleotide sequences
Protein sequences
3-D Structure
3 -D Structure
Word weight
VAST
BLASTBLAST
Phylogeny
NC
BI
Fie
ldG
uid
eHome Page: Global Entrez Portal
hfe
NC
BI
Fie
ldG
uid
eGlobal Entrez Search: HFE
NC
BI
Fie
ldG
uid
eEntrez Nucleotide: HFE218 records
Not HFE [Title]
NC
BI
Fie
ldG
uid
eSmarter Query
hfe[title] AND human[orgn]
39 records
Curated HFE splice variants(11 total)
NC
BI
Fie
ldG
uid
ehfe[title] AND human[orgn] (con’t)
Primary data
NC
BI
Fie
ldG
uid
eFinding Primary Sequences
• Entrez Nucleotide
99+% GenBank (primary data)
– srcdb ddbj/embl/genbank[properties] = 39,849,856 records
<1% RefSeq (curated data)
– srcdb refseq[properties] = 304,945 records
• Useful search terms in [Properties]:
– srcdb : source database (e.g., srcdb genbank[prop])
– gbdiv : GenBank division (e.g., gbdiv est[prop])
– biomol : biomolecule type (e.g., biomol mrna[prop])
NC
BI
Fie
ldG
uid
eDatabase Queries
#1 hfe 116#2 hfe[title] AND human[orgn] 42
#3 #2 AND srcdb refseq[prop] 11#4 #2 AND srcdb ddbj/embl/genbank[prop] 31
#5 #2 AND gbdiv pri[prop] 29#4 #2 AND gbdiv est[prop] 2
Primate division gbdiv pri[prop]EST division gbdiv est[prop]
NC
BI
Fie
ldG
uid
eMolecule Queries
#1 hfe 116#2 hfe[title] AND human[orgn] 42
#3 #2 AND biomol mrna[prop] 29#4 #2 AND biomol genomic[prop] 13
Genomic DNA biomol genomic[prop]cDNA biomol mrna[prop]
NC
BI
Fie
ldG
uid
eMore Queries…
RefSeq status, variants: reviewed RefSeqs with transcript variants
srcdb refseq reviewed[prop] AND has transcript variants[prop]
Gene symbol: human hemochromatosis (HFE)
hfe[sym] AND human[organism]
Disease and Gene Ontology: membrane proteins linked to cancer
integral to plasma membrane[gene ontology] AND cancer[dis]
Chromosome, Links: genes on human chromosome 2 with OMIM links
2[chromosome] AND gene omim[filter] AND human[organism]
Protein name: topoisomerase genes from Archaea
topoisomerase[gene/protein name] AND archaea[organism]
NC
BI
Fie
ldG
uid
eOther Entrez Databases
UniSTS: markers on the Genethon map of human chromosome 12
Genethon[Map Name] AND human[organism] AND 12[chromosome]
UniGene: rat clusters that have at least one mRNA
rat[organism] NOT 0[mrna count]
Structure: structures of bacterial kinases with resolutions below 2 Å
bacteria[organism] AND kinase AND 000.00:002.00[resolution]
SNP: uniquely mapped microsatellites on human chr2
microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn]
NC
BI
Fie
ldG
uid
eSearch by Sequence
NC
BI
Fie
ldG
uid
eRelated Sequences
Most similar
Least similar
NC
BI
Fie
ldG
uid
eSearch by Sequence: protein
NC
BI
Fie
ldG
uid
eBLink (BLAST Link)
NC
BI
Fie
ldG
uid
eBLink Output
NC
BI
Fie
ldG
uid
eBLink → Multiple sequence alignment