Upload
vanessa-collins
View
220
Download
0
Embed Size (px)
Citation preview
Cédric Notredame (19/04/23)
Our Scope
Give you means to answer simple questions
Databases are UNFRIENDLY INFORMATION DESKS
Give you an idea of what is possible
WHAT can you ask ?
HOW can you ask it ?
Cédric Notredame (19/04/23)
Outline
- An Overall view
- Asking a biological question to a database
- Turning a question into a query
- Bibliographic Databases: Medline, OMIM
- Gene Databases: GenBank, LocusLink, ENSEMBL
- Protein Databases: SwissProt, InterPro, Prodom
- SRS
Cédric Notredame (19/04/23)
DataBase Entries
1 entry = 1 SequenceAGCTGTCGAGGGATAGGACATATACATAAATTAATATAAT
1 entry = 1 File = Sequence +DocSEQ
DOC
= Flat File
Database = Collection of Flat FilesSEQ
DOCSEQ
DOCSEQ
DOCSEQ
DOCSEQ
DOCSEQ
DOCSEQ
DOC
Cédric Notredame (19/04/23)
DataBase Entries: Flat Files
Accession number: 1
First Name: Amos
Last Name: Bairoch
Course: DEA=oct-nov-dec 2002
http://www.expasy.org/people/amos.html
//
Accession number: 2
First Name: Laurent
Last name: Falquet
Course: EMBnet=sept 2000, sept 2001;DEA=oct-nov-dec 2000;
//
Accession number 3:
First Name: Marie-Claude
Last name: Blatter Garin
Course: EMBnet=sept 2000; sept 2001; DEA=oct-nov-dec 2000;
http://www.expasy.org/people/Marie-Claude.Blatter-Garin.html
//
Cédric Notredame (19/04/23)
DataBase: Relational Databases
TeacherAccession number
Education
Amos 1 Biochemistry
Laurent 2 Biochemistry
M-Claude 3 Biochemistry
CourseDate Involved
teachers
DEA Oct-nov-dec 2000 1,3
EMBnet Sept 2000, Sept 2001 2,3
Relational database (« table file »):
Cédric Notredame (19/04/23)
To Summarize: What’s a database ?
Collection of Data that is:•Structured Data •Searchable (index) -> table of contents
•Updated periodically (release) -> new edition
•Cross-referenced (hyperlinks) -> links with other db
Collection of tools (software) necessary for:
Searching –Updating -Releasing
Data storage managment: flat files, relational databases…
Cédric Notredame (19/04/23)
A large amount of information
More than 1000 different databases
Generally accessible through the webEBI: http://www.ebi.ac.uk/
NCBI: http://www.ncbi.nlm.nih.org
Google: http://www.google.com
Variable size: <100Kb to >10GbDNA: > 10 Gb
Protein: 1 Gb
3D structure: 5 Gb
Other: smaller
Update frequency: daily to annually
Cédric Notredame (19/04/23)
A Non Exhaustive List
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc .................. !!!!
There Exists A Specialized Database on Almost anything you can think of
Cédric Notredame (19/04/23)
What’s on the Menu:The Art of Eating Well
Always Use Fresh Data: The Latest Update of your DataBase
Make Sure The DataBase is Maintained: Many Databases are poorly maintained
Treat DataBases like Publications: Some Journals are Better than Others
Cédric Notredame (19/04/23)
Searching Databases
There are 2 ways to search databases
Text based queries: Medline, EntrezSEQ
DOCSearch For « Smith AND dUTPase>
Similarity Searches: BLASTAGCTGTCGAGGGATAGGACATATACATAAATTAATATAAT
Cédric Notredame (19/04/23)
Searching Databases
Each database is a little kingdom…
Has its own query system
Has its own information structure
The main databases are well documentedand this documentation is available online
Most databases can be searched using SRSor Entrez
Cédric Notredame (19/04/23)
Databases: Asking the right Question
Databases ARE NOT meant for browsing
When you search a Database you must have an idea of what your Needle-in-a-hay-stack looks like
Cédric Notredame (19/04/23)
Databases: Asking the right Question
Browsing a database is like Using your
phone book in place of a dating agency…
Cédric Notredame (19/04/23)
Databases: Asking the right Question
Finding Data: Database Search
Finding Questions: Data Mining
Cédric Notredame (19/04/23)
The Kind Of Questions We Can Ask:
SEQUENCE Based
InterPro Any Known Domain in my Protein ???
SwissProt Any Protein like mine ???
These ARE Predictions
Cédric Notredame (19/04/23)
The Kind Of Questions We Can Ask:
TEXT Based
Medline Who Worked on my Protein ???
SwissProt Function of My Protein ???
PDB Structure of My Protein ???
These are NOT Predictions
Cédric Notredame (19/04/23)
What is in Medline ?
MEDLINE covers the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences
more than 4,000 biomedical journals and More than 10 million citations since 1966 until now
Contains links to biological db and to some journals
nMany papers not dealing with human are not in Medline
nBefore 1970, keeps only the first 10 authors !
Cédric Notredame (19/04/23)
Using Medline: Asking a question
During the last Lab Meeting, I heard the word dUTPase.
What can it be ? What has been published on this ?
Cédric Notredame (19/04/23)
Using Medline: Asking a question
By Default, Medline Assumes you mean:
Abergel AND dUTPase
Cédric Notredame (19/04/23)
Using Medline: Asking a question
I have found the reference I wanted.
Now I want to save it so that I can use it later, For instance to Import it in ENDnote my Reference Manager
Save Your Data in the Proper DataBase format
Cédric Notredame (19/04/23)
Retrieving EXACTLY the Information that you need
[AB] [AD]
Restricted fields
Cédric Notredame (19/04/23)
Using Medline: Looking for a Review
I Want to Find the LATEST REVIEW on the dUTPase.
Use The Limit Option of Medline
Cédric Notredame (19/04/23)
Using Medline: Looking For a Review
LanguageTitle OR Abstract
Article type
1-Limits
Cédric Notredame (19/04/23)
Using Medline: A Few Tips
•Quoted queries (e.g. «down syndrome» ) behave as a single word, and are great to improve the relevance of your search
•Adding initials to names (e.g. “Abergel C” ) (if you can) also reduces your output
•Write down the PubMed Identifier (the number in the PMID field) of that interesting paper you just find. It could be very useful in your subsequent search for related items such as associated gene and protein sequences
Cédric Notredame (19/04/23)
Using Medline: A Few Tips
•Spelling mistakes, wrong field restrictions or Limits setting can occur. These may be the problem.
•Use abstracts to enlarge your vocabulary and look for synonyms: some papers on dUTPase might use dUTP pyrophosphatase instead!
•The “related papers” button (on the extreme right of the PubMed output). Try it from time to time, to enlarge a search that is not giving you enough references
Cédric Notredame (19/04/23)
Using Medline: A Few Tips
•Storing your PDFs,•Memory is cheap, access is sometimes strange…•Storing your favourite PDF is a good idea
•Which name on your disk?
•THE MEDLINE ID NUMBER !!!
•With a reference manager like EndNote
Cédric Notredame (19/04/23)
GenBank: an Overview
EMBL
DDBJ
GenBank
EMBL, GenBank and DDBJ are the same database. They are synchronized every day.
Cédric Notredame (19/04/23)
GenBank: an Overview
GenBank contains EVERY piece of DNA that has been sequenced and made publicly available.
It contains GOOD and BAD data
There is a Historical Aspect in the GenBank data:
-Complex Genes are spread in many entries:
Cédric Notredame (19/04/23)
GenBank Entries Are Complex because Genes are complex
Prokaryotic Example
GenePromoter RBS
Protein
ORF
mRNASTOPATG
Cédric Notredame (19/04/23)
GenBank Entries Are Complex because Genes are complex
Gene
Promoter
Protein (form2)
Protein (form1)
mRNA (form1)
mRNA (form2)
exonexon exon exon exonexon
Cédric Notredame (19/04/23)
What is the Sequence of the E. Coli dUTPase ?
Using GenBank: Asking a question
Cédric Notredame (19/04/23)
Using GenBank: Asking a questionThe Naive Way
This search reports EVERY GenBank entry that contains these two words.
Most Bacterial Genomes Entries (annotated by similarity) Contain these two words
Escherichia coli dUTPase
Cédric Notredame (19/04/23)
Using GenBank: Asking a questionThe Right Way
Escherichia coli[organism] dUTPase[definition]
Cédric Notredame (19/04/23)
Using GenBank: And There Is Plenty More where It comes from…
If a Gene is published more than once, Each publication gets its own entry
This can mean MANY ENTRIES if you have SNPs or ESTs
GenBank Is Redundant:
Cédric Notredame (19/04/23)
What is the Sequence of the Human dUTPase ?
Using GenBank: Asking a question
What is the Sequence of the E. Coli dUTPase ?
Cédric Notredame (19/04/23)
Using GenBank: Finding the Human dUTPase
2-Check box here to exclude ESTs
1-Request Limits
Cédric Notredame (19/04/23)
Using GenBank: Finding the Human dUTPase
The Gene does NOT appear in a single entry
Cédric Notredame (19/04/23)
Some Good News…
-This Information is complicated because it is RAW Information
-It is necessary to keep UNINTERPRETED Experimental Information available
-There are SIMPLER alternatives to using this RAW Information:
-Gene Centric Databases-Protein Databases
Cédric Notredame (19/04/23)
OMIM: Finding Out About The Phenotype of a Gene
OMIM™: Online Mendelian Inheritance in Man
A catalog of human genes and genetic disorders
Contains a summary of literature, pictures, and reference information. It also contains numerous links to articles and sequence information.
Cédric Notredame (19/04/23)
Gathering Everything you need on a gene
GenBank: What is the Sequence ?
LocusLink: What about this Gene?
ENSEMBL: What is the Context?
MEDLINE: Are There Papers?
OMIME: Are There Illnesses?
Cédric Notredame (19/04/23)
The Protein Databases
GenBank: A Big Bag of DNA
PREDICTION+
EXPERIMENT
Generic Non Redundant Protein
DatabasesNR
trEMBLSpecialized Protein
DatabasesSwissProt
PIR
Cédric Notredame (19/04/23)
What Is SwissProt ?
Fully-annotated (manually), non-redundant, cross-referenced, documented protein sequence database.
~100 ’000 sequences from more than 6’800 different species; 70 ’000 references (publications); 550 ’000 cross-references (databases); ~200 Mb of annotations.
Collaboration between the SIB (CH) and EMBL/EBI (UK)
Cédric Notredame (19/04/23)
Using SwissProt: Asking a question
We hear the word EPO quite often these days, but whatexactly is known about it ?
Cédric Notredame (19/04/23)
Using SwissProt: Asking a question
A Simple SwissProt Text Query
EPO HUMAN
Cédric Notredame (19/04/23)
The Protein Databases
GenBank: A Big Bag of DNA
PREDICTION+
EXPERIMENT
Specialized Protein DatabasesSwissProt
PIRUniProt
Generic Non Redundant Protein
DatabasesNR
trEMBL
Cédric Notredame (19/04/23)
PDB: The Protein Database
Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).
Contains macromolecular structure data on proteins, nucleic acids, protein-nucleic acid complexes, and viruses.
Currently there are ~16’000 structure data for about 4’000 different molecules, but far less protein families (highly redundant) !
Cédric Notredame (19/04/23)
Using PDB: Asking a question
Does tolB have a known Structure? And If the answer is Yes, How can I look at it ?
Cédric Notredame (19/04/23)
Using InterPro: Asking a question
Which Domains does the oncogene FosB contain?
Cédric Notredame (19/04/23)
Using Domains: Some Statistics
• 10 most common protein domains for H. sapiens
Immunoglobulin and major histocompatibility complex domainZinc finger, C2H2 typeEukaryotic protein kinaseRhodopsin-like GPCR superfamilyPleckstrin homology (PH) domainRING fingerSrc homology 3 (SH3) domainRNA-binding region RNP-1 (RNA recognition motif)EF-hand familyHomeobox domain
Cédric Notredame (19/04/23)
Gathering Everything you need on a Protein
trEMBL: What is the Sequence ?
MEDLINE: Are There Papers?
PDB: Which Structure?
INTERPRO: Which Domains?
SwissProt:What about the Function