Upload
gladys-wiggins
View
216
Download
0
Embed Size (px)
Citation preview
A Field Guide to GenBank A Field Guide to GenBank and NCBI Molecular Biology and NCBI Molecular Biology
ResourcesResources
slightly modified fromslightly modified from
Peter CooperPeter Cooperftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/
Eric SayersEric Sayersftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/
• About NCBIAbout NCBI• NCBI Sequence DatabasesNCBI Sequence Databases
– Primary Database – GenBankPrimary Database – GenBank– Derivative Databases - RefSeqDerivative Databases - RefSeq
• Entrez Databases and Text Entrez Databases and Text SearchingSearching
• BLAST ServicesBLAST Services• Genomic ResourcesGenomic Resources
NCBI NCBI ResourcesResources
The National Center for The National Center for Biotechnology Biotechnology
Information (NCBI)Information (NCBI)• Created as a part of NLM in 1988Created as a part of NLM in 1988
– Establish public databasesEstablish public databases– Perform research in computational Perform research in computational
biologybiology– Develop software tools for sequence Develop software tools for sequence
analysisanalysis– Disseminate biomedical informationDisseminate biomedical information
• Tools: BLAST(1990), Entrez (1992)Tools: BLAST(1990), Entrez (1992)• GenBank (1992)GenBank (1992)• Free MEDLINE (PubMed, 1997)Free MEDLINE (PubMed, 1997)• Human genome (2001)Human genome (2001)
NCBI Home PageNCBI Home Pagehttp://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov
To learn more, visit To learn more, visit thethe “ “Site MapSite Map” and ” and ““About NCBIAbout NCBI””web pagesweb pages
About NCBIAbout NCBI
Some NCBI Statistics….Some NCBI Statistics….Growth of GenBank
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002
Ba
se P
airs
of D
NA
(m
illio
ns)
0123456789
1011121314151617181920212223
Se
qu
en
ces
(mill
ion
s)
Base Pairs Sequences
Users per dayUsers per day
0
50000
100000
150000
200000
250000
1997 1998 1999 2000 2001
Christmas Day
Molecular Molecular DatabasesDatabases• Primary DatabasesPrimary Databases
– Original submissions by experimentalistsOriginal submissions by experimentalists– Database staff organize but don’t add Database staff organize but don’t add
additional informationadditional information• Example:Example: GenBankGenBank
• Derivative DatabasesDerivative Databases– Human curatedHuman curated
• compilation and correction of datacompilation and correction of data
• Example:Example: SWISS-PROT, NCBI RefSeq mRNASWISS-PROT, NCBI RefSeq mRNA
– Computationally DerivedComputationally Derived• Example:Example: UniGeneUniGene
– CombinationsCombinations• Example:Example: NCBI Genome AssemblyNCBI Genome Assembly
What is GenBank?What is GenBank? NCBI’s Primary Sequence NCBI’s Primary Sequence
DatabaseDatabase• Nucleotide only sequence database Nucleotide only sequence database • GenBank DataGenBank Data
– Direct submissions individual records (BankIt, Direct submissions individual records (BankIt, Sequin)Sequin)
– Batch submissions via email (EST, GSS, STS)Batch submissions via email (EST, GSS, STS)– ftp accounts established for sequencing centersftp accounts established for sequencing centers
• Data shared amongst three collaborating Data shared amongst three collaborating databases:databases:– GenBankGenBank– DNA Database of Japan (DDBJ). DNA Database of Japan (DDBJ). – European Molecular Biology Laboratory European Molecular Biology Laboratory
Database Database
(EMBL)(EMBL)
The International Nucleotide SequenceThe International Nucleotide SequenceDatabase CollaborationDatabase Collaboration
EBI
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
SequinBankItftp
GenBank: GenBank: NCBI’s Primary Sequence NCBI’s Primary Sequence DatabaseDatabase
• full release every two months• incremental and cumulative updates daily• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
Release 133 December 2002
22,318,883 Records 28,507,990,166 Nucleotides 110,000 + Species
>90 Gigabytes of data
Entrez Entrez NucleotideNucleotide
GenBank 71%
DDBJ 19%
EMBL 9%
RefSeq 1%
23,464,770 records
ATTGACTA
Primary vs. Derivative DatabasesPrimary vs. Derivative DatabasesACGTGC
TTGACA
CG
TG
AATTGACTA
TA
TA
GC
CG
ACGTGC
ACGTGC
AC
GT
GC
TTGACA
TTGACA
TTGACA
CG
TGA C
GTG
A
CG
TG
A
ATT
GA
CTA
ATTGACTA ATTGACTA
ATTGACTA
TATAGCCG
TATAGCCG
TATA
GC
CG
TATAGCCG
GenBank
TATAGCCG TATAGCCGTATAGCCGTATAGCCG
ATGA
CATT
GAGA
ATTATT
CC GAGA
ATTC
CGAGA
ATTATT
CC GAGA
ATTC
C
SequencingCenters
GAGA
ATTC
C GAGA
ATTC
C
UniGene
RefSeq
GenomeAssembly
Labs
Curators
Algorithms
TATAGCCGAGCTCCGATACCGATGACAA
Traditional GenBank Traditional GenBank DivisionsDivisions
BCT Bacterial and Archeal INV InvertebrateMAM Mammalian (ex. ROD and PRI)PHG PhagePLN Plant and FungalPRI PrimateROD RodentSYN Synthetic (cloning vectors)VRL ViralVRT Other Vertebrate
•Direct Submissions (Sequin and BankIt)•Accurate•Well characterized
A Traditional GenBank A Traditional GenBank RecordRecordLocus Field Molecule Type
GenBank Division
Modification DateDefinition Line
Accession NumberVersion
Taxonomy
GI (GenInfo)Keywords
A Traditional GenBank A Traditional GenBank RecordRecord
Bulk Sequence Bulk Sequence Divisions of GenBankDivisions of GenBank
EST Expressed Sequence Tag STS Sequence Tagged SiteGSS Genome Survey SequenceHTG High Throughput GenomicHTC High Throughput cDNA
•Batch Submissions (email and ftp)•Inaccurate•Poorly Characterized
Organization of GenBankOrganization of GenBank
EST 67%
GSS 19%
Traditional 8%PAT 4%
23,087,196 records
STS, HTG, HTC 2%
11 Traditional Divisions
5 Bulk Divisions
1 Patent Division
A gene-oriented view of sequence entries
•MegaBlast-based automated sequence
clustering
•Nonredundant set of gene-oriented
clusters
•Each cluster represents a unique gene
•Provides information on tissue-specific
expression and map locations
•Includes well-characterized genes and
novel ESTs
•Useful for gene discovery and selection
of mapping reagents
What is UniGene?What is UniGene?
Organisms RepresentedOrganisms Representedin UniGenein UniGene
Genome Genome Sequencing Sequencing
Draft Sequence (HTG division)
shredding
Whole BAC insert (or genome)
cloning isolating
assembly
sequencing
GSS divisionor trace archive
Working Draft SequenceWorking Draft Sequence
gaps
phase 1
phase 2
phase 3 ROD
Acc = AC109609.1
Acc =AC109609.6
Acc = AC109609.10
HTG
HTG
HTG Division: HTG Division: HHigh igh TThroughput hroughput GGenomeenome
HTG Division: HTG Division: HHigh igh TThroughput hroughput GGenomeenome
NCBI’s Third Party Annotation NCBI’s Third Party Annotation (TPA) Database (TPA) Database
• NCBI now accepts the submission of NCBI now accepts the submission of new annotations of new annotations of existingexisting GenBank GenBank sequences;sequences;
• Facilitates the annotation of Facilitates the annotation of genomes by experts;genomes by experts;
NEW
A Sample TPA record A Sample TPA record
RefSeq: RefSeq: NCBI’s Derivative Sequence NCBI’s Derivative Sequence
DatabaseDatabase• Curated transcripts and proteinsCurated transcripts and proteins
– reviewedreviewed– human, mouse, rat, fruit fly, zebrafish, arabidopsishuman, mouse, rat, fruit fly, zebrafish, arabidopsis
• Human model transcripts and Human model transcripts and proteinsproteins
• Assembled Genomic Regions Assembled Genomic Regions (contigs)(contigs)– draft human genomedraft human genome– mouse genomemouse genome
• Chromosome recordsChromosome records– MicrobialMicrobial– viralviral– organelleorganelle
The RefSeq Accession The RefSeq Accession NumbersNumbersmRNAs and Proteins
NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted Transcript (human, mouse)XP_123456 Predicted Protein (human, mouse)XR_123456 Predicted non-coding RNAGene RecordsNG_ 123456 Reference Genomic Sequence (human)AssembliesNT_ 123456 Contig (Mouse and Human)NW_123456 Supercontig (Mouse)NC_ 123456 Chromosome (Microbial,Viral,Arabidopsis )NR_ 123456 Interim Identifier for Microbial
Chromosomes
humanmouseratfruit flyzebrafishArabidopsis
Curated RefSeq Records: Curated RefSeq Records: NM_, NM_, NP_NP_
Entrez:Entrez:Linking and NeighboringLinking and Neighboring
The Entrez DatabasesThe Entrez Databases
The The (ever)(ever) Expanding Entrez Expanding Entrez SystemSystem
Nucleotide
Protein
Structure
PubMed
PopSet
Genome
OMIM
Taxonomy
Books
ProbeSet
3D Domains
UniSTS
SNP
CDD
Entrez
UniGeneJournals
PubMedCentral
glucose 6 phosphate dehydrogenase
Entrez NucleotidesEntrez Nucleotides
Document Summaries:Document Summaries:glucose 6 phosphate dehydrogenase[All Fields] = 748 hits
glucose 6 phosphate dehydrogenase
Entrez Nucleotides: Limits Entrez Nucleotides: Limits AccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle WordUidVolume
Entrez Nucleotides: Entrez Nucleotides: Preview/IndexPreview/Index
Adding Terms: Adding Terms: Preview/IndexPreview/Index
AccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence Length. . .
Plant G6PD Plant G6PD mRNAsmRNAs
Display: Display: Formats, Links, and NeighborsFormats, Links, and Neighbors
SummaryBriefASN.1FASTAXMLGenBankGI listLinkOutNucleotide NeighborsGenome LinksProbeSet LinksOMIM LinksPopSet LinksProtein LinksPubMed LinksSNP LinksStructure LinksTaxonomy LinksUniSTS Links
>gi|603218|gb|U18238.1|MSU18238 Medicago sativa glucose-6-phosphate dehydCCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGAGATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGCTTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATTGTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAACAAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTTTACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGATTTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCTCCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGATGGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGATTGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAACATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGCAGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCGAGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAGCCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTGTTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGCAACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCCCTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAAAGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAAGCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGGATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTCGCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTTGAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGATATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACAAGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTCTCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGAATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA
>
FASTA definition line>gi|603218|gb|U18238.1|MSU18238
gi number
Database identifiersgb GenBankemb EMBLdbj DDBJsp SWISS-PROTpdb Protein Databankpir PIRprf PRFref RefSeq
Accession number
Locus name
Entrez GenomeEntrez Genome
Organism Organism PagesPages
The Map Viewer: The Map Viewer: a common platform for integrated displaya common platform for integrated display
The Map Viewer The Map Viewer
Entrez PubMedEntrez PubMed
Online BooksOnline Books
Entrez Specialized Entrez Specialized DatabasesDatabases
Taxonomy
OMIM
ProbeSet
Searchable taxonomic tree havingnodes for all species with records inan Entrez database
Online Mendelian Inheritance in Man:A database of genetically linkedhuman diseases
Expression data (GEO) and microarraydatasets
Entrez Taxonomy Entrez Taxonomy
Entrez OMIMEntrez OMIM
Entrez Entrez ProbeSetProbeSet
Trace ArchiveTrace Archive
Entrez StructureEntrez Structure
Structure Structure SummarySummary
Cn3D viewer
Related Structures
Conserved Domains
Cn3D: Displaying StructuresCn3D: Displaying Structures
Structural Structural AlignmentAlignment