263
Protein Sequence Databases [email protected] Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases http://education.expasy.org/cours/Murcia2011/ Murcia, February, 2011

Protein Sequence Databases [email protected] Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Embed Size (px)

Citation preview

Page 1: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

[email protected] group, GenevaSIB Swiss Institute of Bioinformatics

Protein sequence databases

http://education.expasy.org/cours/Murcia2011/

Murcia, February, 2011

Page 2: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Menu

Introduction

Nucleic acid sequence databases ENA, GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Other databases (Ensembl, IPI, CCDS, …)

Murcia, February, 2011Protein Sequence Databases

Page 3: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Menu

Introduction

Nucleic acid sequence databases ENA, GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Murcia, February, 2011Protein Sequence Databases

Page 4: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Indispensible for bioinformatic studies

1. Databases (free access on the web)

2. Software tools3. Servers

Murcia, February, 2011Protein Sequence Databases

Page 5: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

• A collection of related data, which are– structured – searchable – updated periodically– cross-referenced

• Includes also associated tools necessary for access/query, download, etc.

What is a database ?

Murcia, February, 2011Protein Sequence Databases

Page 6: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Why biological databases ?

• Exponential growth in biological data.

• Data (genomic sequences, protein sequences, 3D structures, 2D gel electrophoresis, MS analysis, microarrays, publications….) are no longer published in a conventional manner, but directly submitted to databases.

• Essential tools for biological research.

Murcia, February, 2011Protein Sequence Databases

Page 7: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

The NAR Online Molecular Biology Database collection in 2011A total of 1’330 databases

http://nar.oxfordjournals.org/content/38/suppl_1

Murcia, February, 2011Protein Sequence Databases

Page 8: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Categories of databases for Life Sciences

• Sequences (DNA, protein)• Genomics• 3D structure• Mutation/polymorphism• Protein domain/family• Metabolism/Pathways• Bibliography

• ‘Others’ (Protein protein interaction, Microarrays…)

Murcia, February, 2011Protein Sequence Databases

Page 9: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Categories of databases for Life Sciences

• Sequences (DNA, protein)– DNA/RNA: EMBL/GenBank/DDBJ, – Protein: UniProtKB, NCBInr

• Genomics- OMIM, Flybase

• 3D structure– PDB

• Mutation/polymorphism– dbSNP

• Protein domain/family– InterPro

• Metabolism/Pathways– KEGG

• Bibliography– PubMed

– ‘Others’ (Protein protein interaction, Microarrays…)

Murcia, February, 2011Protein Sequence Databases

Page 10: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Page 11: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

DNA sequences

Human GenomeGene Annotation

Protein Sequences

MacromolecularStructure Data

MicroarrayExpression Data

Murcia, February, 2011Protein Sequence Databases

Page 12: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Proliferation of databases

•Which does contain the highest quality data ?•Which is comprehensive ?•Which is up-to-date ?•Which is redundant ?•Which is indexed (allows complex queries) ?•Which Web server does respond most quickly ?• …….??????

Page 13: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Awareness of the content and usage of knowledge

resources is a pre-requisite to do any type of « serious »

research in the field of molecular life sciences

(AMB, 2007)

Murcia, February, 2011

Page 14: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Where can we find…

•A video -> Youtube•Info on S. Hawking-> Wikipedia•A book -> Amazon•A friend -> Facebook

– Usually only one server

•DNA sequence -> EMBL•Protein sequence -> UniProtKB, RefSeq…

– Several different servers give access to the ‘same’ database

Page 15: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Servers• ‘Any computer (…) serving out

applications or services can technically be called a server. ‘ (Wikipedia)

Murcia, February, 2011Protein Sequence Databases

Page 16: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

EBI: http://www.ebi.ac.uk/

Murcia, February, 2011Protein Sequence Databases

Page 17: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

NCBI: http://www.ncbi.nlm.nih.gov/

Murcia, February, 2011Protein Sequence Databases

Page 18: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

ExPASy: http://expasy.org

Murcia, February, 2011Protein Sequence Databases

Page 19: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

www.uniprot.org

Murcia, February, 2011Protein Sequence Databases

Page 20: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

How to find a database ?

Beware not all servers give access to the latest version of the database. Important to know the ‘home server’ for a given database.

– ExPASy life sciences directory: -> ‘home’ server links (www.expasy.org/alinks.html)

– Google (http://www.google.com) (not always linked to the ‘home’ server)

Murcia, February, 2011Protein Sequence Databases

Page 21: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

http://www.expasy.org/

Murcia, February, 2011Protein Sequence Databases

Page 22: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

http://www.expasy.org/links.html

http://www.expasy.org/links.html

Murcia, February, 2011Protein Sequence Databases

Page 23: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Page 24: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

The same data on different servers….

UniProt NCBI

Murcia, February, 2011Protein Sequence Databases

Page 25: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

http://srs.dna.affrc.go.jp/srs8/srs?-id+1QexuT1Yn4Di0xF+[uniprot_swissprot-AccNumber:P16855]+-e

Murcia, February, 2011Protein Sequence Databases

Page 26: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Proteins…proteins

Murcia, February, 2011Protein Sequence Databases

Page 27: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Protein sequences are the fundamental determinants

of biological structure and function.

http://www.ncbi.nlm.nih.gov/protein

Murcia, February, 2011

Page 28: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Protein sequence databases are essential for…

- Identification of proteins by proteomics- -> completeness, sequence quality

‘producing large protein lists is not the end point in Proteomics’ -> extract knowledge

- Similarity searches, BLAST (functional prediction)- -> sequence quality (no redundance)

- Training datasets (prediction tools, PTM etc.)- -> sequence and annotation quality

- Creation of DNA chips for mRNA expression studies- -> completeness (complete proteome), sequence quality

Page 29: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

TrEMBL Genpept

Swiss-Prot

RefSeq PRF

Ensembl

CCDS

UniParc

UniProtKB

PDB(PIR)

(IPI)

UniMES

TPA

NCBInr

?

Murcia, February, 2011Protein Sequence Databases

Page 30: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

These identifiers are all pointing to a same sequence of TP53 (p53) !

P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676 , JT0436 , etc.

Murcia, February, 2011

Page 31: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

A HUPO test sample study reveals common problems in mass spectrometry–based proteomics

PubMed 19448641 (2009)

• A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides)

• Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results).

• Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein…

Murcia, February, 2011

Page 32: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein sequence origin…

Murcia, February, 2011Protein Sequence Databases

Page 33: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Protein sequence origin

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

(genomes and/or cDNAs)

-> Important to know where the protein sequence comes from…

(sequencing & gene prediction quality) !

Murcia, February, 2011

Page 34: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Flood of data

example with the genome sequences…

Page 35: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

New challenge

Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery

Murcia, February, 2011Protein Sequence Databases

Page 36: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)

… ~ 5’000 ongoing genome sequencing projects

Murcia, February, 2011

Page 37: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.htmlhttp://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat

~ 50-100 genomes/month

+ ~2’500 viral genomes=> Total ~ 5’000 genomes 

Page 38: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

… ~ 2500 genomes sequenced (single organism, varying sizes, including virus)

… ~ 5’000 ongoing genome sequencing projects

… cDNAs sequencing projects (ESTs or cDNAs)

… metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms,

Murcia, February, 2011

Page 39: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Metagenomicsstudy of genetic material recovered directly from environmental samples

• Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus

• Whale fall (AAFZ00000000.1)

• Soil, sand beach, New-York air, …

• Human fluids, mouse gut (millions of bacteria within human body)

• Water treatment industry…

• Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi

Venter’s Sorcerer II

Murcia, February, 2011Protein Sequence Databases

Page 40: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

… ~ 2500 genomes sequenced (single organism, varying sizes)

… ~ 5’000 ongoing genome sequencing projects

… cDNAs sequencing projects (ESTs or cDNAs)

… metagenome sequencing projects

… personal human genomes

new generation sequencers : Illumina: 25 billions of bp /day;

Page 41: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

http://www.youtube.com/watch?v=mVZI7NBgcWM

…2700 genomes in 2010, 30’000 genomes in 2011 ?

2’000’000 $(2007)

70’000’000 $(diploid,

2007)

3’000’000’000 $(public consortium,

2000)

300’000’000 $(Celera, 2000)

2010

Murcia, February, 2011

Page 42: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele…

Murcia, February, 2011

Page 43: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

apoE gene (Ensembl genome browser)

Murcia, February, 2011

Page 44: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

New projects

• 1000 genomes (first publication, October 2010)

• Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…)

• International cancer genome consortium (www.icgc.org).

They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals….

Murcia, February, 2011Protein Sequence Databases

Page 45: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

How many proteins-coding genes at the end?

Murcia, February, 2011

Page 46: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Peabody museum exhibition on the Tree of Life http://www.peabody.yale.edu/exhibits/treeoflife/

Murcia, February, 2011

Page 47: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

190‘500'025'0421st estimate: ~30 million species (1.8 million named) 2nd estimate:

20 million bacteria/archea x 4'000 genes

1 million protists x 6'000 genes

5 million insects x 14'000 genes

2 million fungi x 6'000 genes

0.5 million plants x 20'000 genes

0.5 million molluscs, worms, arachnids, etc. x 20'000 genes

0.1 million vertebrates x 25'000 genes

The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000

+20000 (Craig Venter)+ 42(Douglas Adam) + …

Murcia, February, 2011

Page 48: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

About 190 milliards of proteins (?)

About 13.0 millions of ‘known’ protein sequences in 2011(from ~300’000 species)

More than 99 % of the protein sequences are derived from the translation of nucleotide sequences

Less than 1 % direct protein sequencing (Edman, MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequencing & gene prediction quality) !

Page 49: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

cDNAs, ESTs, genes, genomes, …

Nucleic acid sequence databases

The ideal life of a sequence …

Murcia, February, 2011Protein Sequence Databases

Protein sequence databases

Page 50: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Menu

Introduction

Nucleic acid sequence databases ENA/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Murcia, February, 2011Protein Sequence Databases

Page 51: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

ENA (EMBL-Bank) GenBankDDBJDNA Data Bank of Japan

archive of primary sequence data and corresponding annotation submitted by the laboratories that did the

sequencing.

Murcia, February, 2011

European Nucleotide Archive

Page 52: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

http://www.insdc.org/

ENA/GenBank/DDBJ

Murcia, February, 2011

Page 53: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

cDNAs, ESTs, genes, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

The hectic life of a sequence …

Murcia, February, 2011Protein Sequence Databases

Page 54: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC

number is not available…

‘journal publishers generally require deposition prior to publication so that an accession number can be included in

the paper.’ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq

…not the case for protein sequences

!!! no more the case for a lot of genomes !!!

Murcia, February, 2011

Page 55: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

• Serve as archives : ‘nothing goes out’• Contain all public sequences derived from:

– Genome projects (> 80 % of entries)– Sequencing centers (cDNAs, ESTs…)– Individual scientists ( 15 % of entries)– Patent offices (i.e. European Patent Office, EPO)

• Currently: ~200x106 sequences, ~300 x109 bp;• Sequences from > 300’000 different species;

ENA/GenBank/DDBJ

Murcia, February, 2011

Page 56: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Archival databases:

- Can be very redundant for some loci

- Sequence records are owned by the original submitter and can not be alterered by a third party (except TPA)

Page 57: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Organisms with the highest redundancy…

Murcia, February, 2011

Page 58: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

taxonomy

Cross-references

references

accession number

Murcia, February, 2011Protein Sequence Databases

Page 59: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

CDS annotation

(Prediction or experimentally determined)

sequence

CDSCoDing Sequence

(proposed by submitters)

Murcia, February, 2011Protein Sequence Databases

Page 60: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The hectic life of a sequence …

cDNAs, ESTs, genes, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

with or without annotated CDS

provided by authors

CDSCoDing Sequence

portion of DNA/RNA translated into protein(from Met to STOP)

Experimentally provedor derived from gene prediction

!!! not so well documented !!!

Murcia, February, 2011

Page 61: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG

Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * **************

 CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG-------------------------------------------------------Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT

CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACGenomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCGenomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC

**************************** ***********************************************

CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGAGenomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************

CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG-----------------------------------------------------------------------------Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAAGenomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * ***

CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC

Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC******************************************* * ************** ******** ***** **** * *********** ***************************

 CONTIG C-----------------------------------------------------------------------------------------------------------------------Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA *  

CoDing SequenceAlignment between a mRNA and a genomic sequence

exon

exon

exon

exon

exon

intron

intron

intron

Murcia, February, 2011Protein Sequence Databases

Page 62: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

CDS translation provided by ENA

CDS provided by the submitters

The first Met !

Murcia, February, 2011Protein Sequence Databases

Page 63: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

A eukaryotic gene (UCSC)

3’ untranslated region

Final exon

Initial exon

Introns

Internal exons

This particular gene lies on the reverse strand !

5’3’

MetSTOP

Murcia, February, 2011Protein Sequence Databases

Page 64: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

UCSC: human EPO

5’ 3’

mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ)

contig

Murcia, February, 2011Protein Sequence Databases

Page 65: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Complete genome (submitted)

but only ~ 2,000 CDS/proteins available !

Murcia, February, 2011Protein Sequence Databases

Page 66: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

…annotated CDS in UniProtKB

Murcia, February, 2011

Page 67: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Variable level of sequence quality

- Sequencing quality- Gene prediction quality

Authors can specify the nature of the CDS by using the qualifier: "/evidence=experimental" or "/evidence=not_experimental".

Very rarely done…

ENA/GenBank/DDBJ

Murcia, February, 2011

Page 68: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Very rarely done…

Murcia, February, 2011

Page 69: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Variable level of sequence quality

DNA vs RNA

Murcia, February, 2011Protein Sequence Databases

Page 70: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

RNA EST: Expressed Sequence Tags produced by one-shot sequencing of a cloned cDNA (no CDS, but proteomic tools give access to‘translated ESTs’)

HTC : High Throughput cDNAs(CDS annotation)

DNAGSS: Genome Sequence Survey: similar to the EST division, with the exception that most of the sequences are genomic in origin(no annotation, no CDS, with some exceptions (Drosophila))

HTG: High-Throughput Genomic Sequences: single-pass, unfinished genomic sequences (no annotation, no CDS with some exceptions (Leishmania))

WGS: Whole Genome Shotgun: contigs of a sequencing project. WGS data can contain annotation and should be updated as sequencing progresses.(CDS annotation) Murcia, February, 2011Protein Sequence Databases

Page 71: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Complete proteomesComplete genomes

?

Murcia, February, 2011

Page 72: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Complete genomes ?? UCSC

Murcia, February, 2011Protein Sequence Databases

Page 73: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

27478 contigs

Genome reference consortiumhttp://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml

N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs equal to or larger than this value

Murcia, February, 2011Protein Sequence Databases

Page 74: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Genome sequencing and assembly

some caveats to deal with…• ~ 350 gaps in 2010 (human genome)

• In the next future, we will have to deal with ‘incomplete genome’ sequences (never finished, metagenome…)… Prediction of ‘partial’ genes/exons is complex !

• Updates of genome sequences: not always ‘stable’ data…

• We are all different: -> ‘pan genome’ ?

Murcia, February, 2011Protein Sequence Databases

Page 75: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 76: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

From nucleic acid to amino acid sequences databases….

Murcia, February, 2011

Page 77: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

The hectic life of a protein sequence …

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

…if the submitters provide an annotated Coding Sequence

(CDS)(1/10 ENA entries)

Protein sequence databases

Nucleic acid databases

Gene predictionRefSeq, Ensembl

no CDS

Page 78: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

The hectic life of a protein sequence …

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

…if the submitters provide an annotated Coding Sequence

(CDS)(1/10 ENA entries)

Protein sequence databases

Nucleic acid databases

Gene predictionRefSeq, Ensembl

no CDS

RefSeq, Ensembl and other*

* 1000 genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/2010_11/

Page 79: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Why doing things in a simple way, when you can do it in a very complex

one ?

Murcia, February, 2011Protein Sequence Databases

Page 80: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

The hectic life of a sequence …

TrEMBL Genpept

CoDing Sequences provided by submitters

cDNAs, ESTs, genomes, …

ENA, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

Swiss-Prot

RefSeq PRF

Scientific publications derived sequences

Ensembl

CCDS

UniParc

UniProtKB

PDB(PIR)

+ all ‘species’ specific databases (EcoGene, TAIR, …)

(IPI)

UniMES

CoDing Sequences provided by submitters

and gene prediction

TPA

Page 81: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Major ‘general’ protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)

TPA: Third part annotation

Integrated resources

‘cross-references’

Resources kept separated

TPA

Page 82: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Swiss-Prot

TrEMBL

Look for toll-like receptor 4

(homo sapiens)

www.uniprot.org

Murcia, February, 2011

Page 83: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

GenPept

Swiss-Prot

RefSeq

GenPept

GenPept

GenPept

GenPept

GenPept

GenPept

Look for toll-like receptor 4

(homo sapiens)

http://www.ncbi.nlm.nih.gov/

Page 84: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Page 85: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Menu

Introduction

Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Murcia, February, 2011Protein Sequence Databases

Page 86: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtWhat is UniProt ?

. UniProtKB sequence curation

. UniProtKB biological data curation

. Statistics

. Access to UniProtKB

Murcia, February, 2011

UniProt consortium: EBI + SIB + PIR

Page 87: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

www.uniprot.org

Murcia, February, 2011Protein Sequence Databases

Page 88: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProt databases

Murcia, February, 2011

Page 89: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~13 mo entries)

UniParc: protein sequence archive (ENA equivalent at the

protein level). Each entry contains a protein sequence with cross-links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries)

UniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries)

UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc) Murcia, February, 2011Protein Sequence Databases

Page 90: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

UniProtKB/Swiss-Prot and UniProtKB/TrEMBL give access to all the protein sequences which are available to the public.

However, UniProtKB excludes the following protein sequences:- Most non-germline immunoglobulins and T-cell receptors- Synthetic sequences- Most patent application sequences- Small fragments encoded from nucleotide sequence (<8 amino acids)- Pseudogenes*- Fusion/truncated proteins- Not real proteins

* many putative pseudogene sequences may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein

Page 91: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKBan encyclopedia on proteins

composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot

released every 4 weeks

Murcia, February, 2011

Page 92: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

UniProtKB

from ENA to TrEMBL

UniProtKB protein sequence data are mainly derived from ENA (CDS) but also from Ensembl

and other sequence resources such as RefSeq or model organism databases (MODs).

Data from the PIR database have been integrated in UniProt since 2003.

Page 93: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

TrEMBL

ENA

Automated extraction of protein sequence

(translated CDS), gene name and references.+Automated annotation

Page 94: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the information provided by the submitter of the

original nucleotide entry.

Automated annotation• Redundancy check (100% merge (same lenght, not fragment))• Family attribution (InterPro)• Many other cross-references• Rule-based automated annotation (~38% of TrEMBL entries)

Automated annotation systems: - UniRule (RuleBase, HAMAP; manually reviewed) - SAAS (automated generated rules, i.e. via InterPro)

Murcia, February, 2011

Page 95: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

One protein sequenceOne species

Automated annotationKeywords

and Gene Ontology

Automated annotationFunction, Subcellular location,

Catalytic activity, Sequence similarities…

Automated annotationtransmembrane domains,

signal peptide…

Cross-references to over 125 databases

References

Protein and gene namesTaxonomic information

UniProtKB/TrEMBLwww.uniprot.org

Murcia, February, 2011Protein Sequence Databases

Page 96: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Page 97: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 98: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 99: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB

from TrEMBL to Swiss-Prot

Once manually annotated and integrated into Swiss-Prot, the entry is deleted from TrEMBL

-> minimal redundancy

Murcia, February, 2011

Page 100: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

TrEMBL

ENA

Automated extraction of protein sequence (translated CDS), gene name and

references.+Automated annotation

Manual annotation of the sequence and associated

biological information

Swiss-Prot

Murcia, February, 2011Protein Sequence Databases

Page 101: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB: from TrEMBL to Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)

Murcia, February, 2011

Page 102: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG

One protein sequenceOne gene

One species

Manual annotationKeywords

and Gene Ontology

Manual annotationFunction, Subcellular location,

Catalytic activity, Disease, Tissue specificty, Pathway…

Manual annotationPost-translational modifications,

variants, transmembrane domains, signal peptide…

Cross-references to over 125 databases

References

Protein and gene namesTaxonomic information

Alternative products:protein sequences produced by

alternative splicing, alternative promoter usage,

alternative initiation…

UniProtKB/Swiss-Protwww.uniprot.org

Murcia, February, 2011Protein Sequence Databases

Page 103: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

In a UniProtKB/Swiss-Prot entry, you can expect to find:

• A (often corrected) protein sequence and the description of various isoforms/variants.

• All the names of a given protein (and of its gene);

• A summary of what is known about the protein: function, PTM, tissue expression, disease, 3D data etc.…;

• A description of important sequence features: domains, PTMs, variations, etc.;

• A selection of references;• Selected keywords and ontologies;• Numerous cross-references (central hub);

Murcia, February, 2011

Page 104: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB

1- Sequence curation

Murcia, February, 2011

Page 105: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB: from TrEMBL to Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)

Murcia, February, 2011

Page 106: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The displayed protein sequence

…canonical, representative, consensus…

Murcia, February, 2011

Page 107: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species.

The displayed sequence is generally derived from the translation of the genomic sequence (when available).

Sequence differences are documented.

1 entry <-> 1 gene (1 species) 1 displayed sequence

(annotation of alternative sequences, when available)

UniProtKB/Swiss-Prot protein sequence annotation‘Merging policy’: a gene-centric view of protein space

Murcia, February, 2011

Page 108: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

What is the current status?

• At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence.

• Typical problems– unsolved conflicts;– uncorrected initiation sites;– frameshifts;– other ‘problems’

Murcia, February, 2011

Page 109: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 110: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

… once a gene on chromosome 11…

Murcia, February, 2011

Page 111: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Quality of protein information from genome projects

• Lets look at proteins originating from genome projects:– Drosophila: the paradigm of a curated genome should look

like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences;

– Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous;

– Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.

– Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)…Murcia, February, 2011Protein Sequence Databases

Page 112: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

UniProtKB/Swiss-ProtProtein sequence annotation

Murcia, February, 2011Protein Sequence Databases

Page 113: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Example of problem (derived from gene prediction pipeline)

Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences..

ID   URAD_HUMAN            Unreviewed;       171 AA. AC   A6NGE7; DT   24-JUL-2007, integrated into UniProtKB/TrEMBL. DT   24-JUL-2007, sequence version 1. DT   02-OCT-2007, entry version 3. DE   2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE   (OHCU decarboxylase homolog) (Parahox neighbour). GN   Name=PRHOXNB; …DR   EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR   Ensembl; ENSG00000183463; Homo sapiens. DR   HGNC; HGNC:17785; PRHOXNB. PE   4: Predicted; In primates the genes coding for the enzymes for the

degradation of uric acid were inactivated and converted to pseudogenes.

Murcia, February, 2011

Page 114: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

• Producing a clean set of sequences is not a trivial task;

• It is not getting easier as more and more types of sequence data are submitted;

• It is important to pursue our efforts to make sure we provide our users with the most correct set of sequences for a given organism.

Page 115: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

• The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein;

• Different qualifiers:1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular

location),…)2. Evidence at transcript level (~19%)3. Inferred from homology (~58 %)4. Predicted (~5%)5. Uncertain (mainly in TrEMBL)

‘Protein existence’ tag

Murcia, February, 2011

Page 116: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 117: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’

Murcia, February, 2011

Page 118: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Page 119: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The ‘alternative’ sequence(s)

Murcia, February, 2011

Page 120: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

How many proteins at the end?

Example with human

Murcia, February, 2011

Page 121: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

Proteome complexityExample with human

Not predictable at the genome level !-> important post-

genomic data !

~20’000

Murcia, February, 2011

Page 122: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB/Swiss-Prot

1 entry <-> 1 gene (1 species)

Annotation of the sequence differences

(including conflicts, polymorphisms, splice variants etc..)

-> annotation of protein diversity

Murcia, February, 2011

Page 123: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Multiple alignment of the end of the available GCR sequences

Annotation of the sequence differences (protein diversity)

1 entry <-> 1 gene (1 species)

…and natural variant

Murcia, February, 2011

Page 124: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

P04150

www.uniprot.org Murcia, February, 2011Protein Sequence Databases

Page 125: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

UniProtKB (and RefSeq) do under-represent alternatively spliced products

Transcript variant are only made when there is information available on the full-lenght nature of the product; if multiple, alternate exons are found through the lenght of the gene, no assumption is made about the combination of the alternate exons that exists in vivo.

http://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_does_a_reviewed_status_me

Page 126: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Available in separated files!

Important remark

> 30’000 additional sequences (total)

Murcia, February, 2011

Page 127: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The ‘alternative’ sequence(s)

not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server

!….

Murcia, February, 2011

Page 128: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Blast P04150 against Swiss-Prot / homo sapiens @ UniProt

Isoform sequences

Murcia, February, 2011Protein Sequence Databases

Page 129: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Blast P04150 against Swiss-Prot / homo sapiens @ NCBI

The isoform sequences are not present in the NCBI protein database !The .x number (P06401.4) correspond to the version number of the sequence…not to an alternatively spliced sequence !

Murcia, February, 2011

Page 130: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB

2- Biological data curation

Murcia, February, 2011

Page 131: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB: from TrEMBL to Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)

Murcia, February, 2011

Page 132: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

• Summary of the current knowledge on a given protein.

• Maximum usage of controlled vocabularyKeywords, Tissues, Post-translational modifications, Strains, Species, Subcellular location, Extracellular domains, Journals…

• Provides a reliable set of annotated protein entries for:• Reference data for systems designed to

automatically transfer annotation to similar, not yet (or never) characterized sequences

• Training of data mining tools, prediction programs

UniProtKB/Swiss-ProtGeneral annotation

Murcia, February, 2011Protein Sequence Databases

Page 133: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB/Swiss-Prot gathers data form multiple sources:

- publications (literature/Pubmed)- prediction programs (Prosite, Anabelle)- contacts with experts - other databases- nomenclature committees

An evidence attribution system allows to easily trace the source of each annotation

Extract literature informationand protein sequence analysis

maximum usage of controlled vocabulary

Murcia, February, 2011

Page 134: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Protein nomenclature

Murcia, February, 2011

Page 135: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

…enable researchers to obtain a summary of what is known about a protein…

General annotation

(Comments)

www.uniprot.org Murcia, February, 2011Protein Sequence Databases

Page 136: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Human protein manual annotation: some statistics (Aug 2010)

Murcia, February, 2011

Page 137: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Sequence annotation

(Features)

…enable researchers to obtain a summary of what is known about a protein…

www.uniprot.org Murcia, February, 2011Protein Sequence Databases

Page 138: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154).

Proteome complexityExample with human

Not predictable at the genome level !-> important post-

genomic data !

~20’000

Murcia, February, 2011

Page 139: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Human protein manual annotation: some statistics

(PTM)

Murcia, February, 2011

Page 140: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Non-experimental qualifiers

UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between

both.

Level. Type of evidence Qualifier

1st. Strong experimental evidence Ref.X

2nd. Light experimental evidence Probable

3rd. Inferred by similarity with homologous protein (data of 1st or 2nd level)

By similarity

4th. Inferred by sequence prediction Potential

Murcia, February, 2011

Page 141: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Find all the protein localized in the cytoplasm (experimentally

proven) which are phosphorylated on a serine

(experimentally proven) Murcia, February, 2011Protein Sequence Databases

Page 142: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB

Additional information can be found in the cross-references (to more than 140 databases)

Murcia, February, 2011

Page 143: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

DNA sequences

Gene annotationGene

expression dataProtein

sequences

Macromolecular structure data

Protein centric view of database network

Murcia, February, 2011Protein Sequence Databases

Page 144: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

2D gel2DBase-EcoliANU-2DPAGEAarhus/Ghent-2DPAGE (no server)

COMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASE (no server)

OGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGEUCD-2DPAGEWorld-2DPAGE

Family and domainGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTSUPFAMTIGRFAMs

Organism-specificAGDArachnoServerCGDConoServerCTDCYGD dictyBaseEchoBASEEcoGeneeuHCVdbEuPathDBFlyBaseGeneCardsGeneDB_SpombeGeneFarmGenoListGrameneH-InvDB HGNCHPA LegioListLepromaMaizeGDBMGIMIMneXtProtOrphanet PharmGKBPseudoCAPRGDSGDTAIRTubercuListWormBaseXenbaseZFIN

Protein family/groupAllergomeCAZyMEROPSPeroxiBasePptaseDBREBASETCDB

Genome annotationEnsemblEnsemblBacteriaEnsemblFungiEnsemblMetazoaEnsemblPlantsEnsemblProtistsGeneIDGenomeReviewsKEGGNMPDRTIGRUCSCVectorBase

Enzyme and pathwayBioCycBRENDAPathway_Interaction_DBReactome

OtherBindingDBDrugBank NextBio PMAP-CutDB

SequenceEMBLIPIPIRRefSeqUniGene

3D structureDisProtHSSPPDBPDBsumProteinModelPortalSMR

PTMGlycoSuiteDBPhosphoSitePhosSite

UniProtKB/Swiss-Prot:129 explicit links

and 14 implicit links!

ProteomicPeptideAtlasPRIDEProMEX

PPIDIPIntAct MINTSTRING

Phylogenomic dbseggNOGGeneTreeHOGENOMHOVERGENInParanoidOMAOrthoDBPhylomeDBProtClustDB

PolymorphismdbSNP

Gene expressionArrayExpressBgeeCleanExGenevestigatorGermOnline

Ontologies GO

Page 145: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB

Access to UniProtKB

Murcia, February, 2011

Page 146: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The UniProt web site:

www.uniprot.org

Murcia, February, 2011

Page 147: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The UniProt web site - www.uniprot.org

• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches (similar to SRS)

• Scoring mechanism presenting relevant matches first

• Entry views, search result views and downloads are customizable

• The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access

• Tools: Blast, Alignment, IDmapping, Batch retrieval (Retrieve)

Murcia, February, 2011

Page 148: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Search

A very powerful text search tool with autocompletion and refinement

options allowing to look for UniProt entries and documentation by

biological information

Murcia, February, 2011

Page 149: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Search

A very powerful text search tool with autocompletion and refinement

options allowing to look for UniProt entries and documentation by

biological information

Murcia, February, 2011

Page 150: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProt query tool (www.uniprot.org)A mixture of Google and SRS

Find all human proteins with experimental evidence for their

location in the nucleus

Murcia, February, 2011

Page 151: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The search interface guides users with helpful suggestions and hints

Murcia, February, 2011

Page 152: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Result pages: Highly customizable

Murcia, February, 2011

Page 153: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Custom downloads….

Accession Genes Domains Protein Existence P02768 ALB (GIG20) (GIG42) (PRO0903) (PRO1708) (PRO2044) (PRO2619) (PRO2675) (UNQ696/PRO1341) Albumin domains (3) Evidence at protein level P02769 ALB Albumin domains (3) Evidence at protein level P02770 Alb Albumin domains (3) Evidence at protein level P07724 Alb (Alb-1) (Alb1) Albumin domains (3) Evidence at protein level P08759 alb-A Albumin domains (3) Evidence at transcript level P14872 alb-B Albumin domains (3) Evidence at transcript level P43652 AFM (ALB2) (ALBA) Albumin domains (3) Evidence at protein level P08835 ALB Albumin domains (3) Evidence at protein level P49822 ALB Albumin domains (3) Evidence at protein level P19121 ALB Albumin domains (3) Evidence at protein level

Open with Excel etc.

Murcia, February, 2011

Page 154: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

The URL (results) can be bookmarked and manually modified.

Murcia, February, 2011Protein Sequence Databases

Page 155: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Blast

A tool associated with the standard options to search

sequences in UniProt databases

Murcia, February, 2011

Page 156: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Blast results: customize display

Murcia, February, 2011Protein Sequence Databases

Page 157: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Blast: use of UniProt annotationamino-acids highlighting options

and feature annotation highlighting option in the local alignment

Murcia, February, 2011Protein Sequence Databases

Page 158: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Align

A ClustalW multiple alignment tool with amino-acids highlighting optionsand feature annotation highlighting

option

Murcia, February, 2011

Page 159: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

ClustalW multiple alignment of insulin

sequencesamino-acids highlighting options

and feature annotation highlighting option in the local alignment

Murcia, February, 2011

Page 160: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Retrieve

A UniProt specific tool allowing to retrieve a list of entries in several standard formats.

You can then query your ‘personal database’ with the UniProt search tool.

Murcia, February, 2011

Page 161: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Your dataset: results of a Scan Prosite

Murcia, February, 2011

Page 162: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

ID Mapping

Gives the possibility to get a mapping between different databases for a given

protein

Murcia, February, 2011

Page 163: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

These identifiers are all pointing to TP53 (p53) !

P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, etc.

Murcia, February, 2011

Page 164: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 165: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Download

Murcia, February, 2011

Page 166: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Downloading UniProt http://www.uniprot.org/downloads

Murcia, February, 2011

Page 167: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Complete proteome

‘gene’ centredor

all known proteins ?

Murcia, February, 2011

Page 168: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

http://www.uniprot.org/faq/38

Murcia, February, 2011

Page 169: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 170: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Remark: Some peptides are not associated with the keyword ‘Complete proteome’ because they do not match with the human genome

Murcia, February, 2011

Page 171: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

UniProt proteome sets, if downloaded in UniProt flat file or XML format, contain one sequence per UniProt record !

‘gene’ centred

all protein sequences in UniProtKB/Swiss-Prot…Are missing: other alternatively spliced protein sequences in UniProtKB/TrEMBL

Page 172: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Human protein manual annotation: some statistics (Aug 2010)

Murcia, February, 2011

Page 173: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB

Statistics

Murcia, February, 2011

Page 174: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

520’000 + 13’000’000 13’000’000

Swiss-Prot & TrEMBL introduce a new arithmetical

concept !

Redundancy in TrEMBL&

Redundancy between TrEMBL and Swiss-Prot

12’000 species 130’000 species

Swiss-Prot TrEMBL

Murcia, February, 2011Protein Sequence Databases

Page 175: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

12’000 speciesmainly model organisms

Murcia, February, 2011

Page 176: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Not yet available

Murcia, February, 2011Protein Sequence Databases

Page 177: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

~ 200 new entries / day new release every 4 weeks

- Annotation is useful, good annotation is better, update is essential !

- Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot

Murcia, February, 2011Protein Sequence Databases

Page 178: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

UniProtKB entry history

Always cite the primary accession number (AC) !

Page 179: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniParc

Murcia, February, 2011

Page 180: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniParc

- non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….)

- the equivalent of ENA/GenBank/DDBJ at the protein level

- species-merged: merge sequences between species when 100% identical over the whole length.

- no annotation (only taxonomy)

- can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs.

- Beware: contains wrong prediction, pseudogenes etc…

Murcia, February, 2011

Page 181: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Query UniParc

Page 182: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniRef

Murcia, February, 2011

Page 183: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

‘UniRef is useful for comprehensive BLAST similarity searches by providing

sets of representative sequences’

Murcia, February, 2011

Page 184: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

«Collapsing BLAST results»

Three collections of sequence clusters from UniProtKB and selected UniParc entries:

One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 %

One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 %

One UniRef50 entry -> sequences that are at least 50 % identical-> reduction of 65 %

Based on sequence identity -> Independent of the species !

Murcia, February, 2011

Page 185: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 186: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Independent of species and

sequence length

UniRef 90

Murcia, February, 2011

Page 187: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniMes

Murcia, February, 2011

Page 188: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment).

Download only (but included in UniParc -> Blast).

- UniMES Fasta sequences- UniMES matches to InterPro methods

ftp.uniprot.org/pub/databases/uniprot

Murcia, February, 2011

Page 189: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 190: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniMES: sequences in fasta format

Murcia, February, 2011

Page 191: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 192: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Menu

Introduction

Nucleic acid sequencedatabases ENA/GenBank, DDBJ

Protein sequence databasesUniProt databases (UniProtKB)

NCBI protein databases

Murcia, February, 2011Protein Sequence Databases

Page 193: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

NCBI protein databases

(Entrez protein, NCBI nr)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Murcia, February, 2011

Page 194: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Major ‘general’ protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species)

TPA: Third part annotation

Integrated resources

‘cross-references’

Resources kept separated

TPA

Murcia, February, 2011Protein Sequence Databases

Page 195: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Query at Entrez protein

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Murcia, February, 2011Protein Sequence Databases

Page 196: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Typical result of a query at

« Entrez protein » RefSeq

Swiss-Prot

Genpept

Murcia, February, 2011Protein Sequence Databases

Page 197: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

A Swiss-Prot entry with the NCBI look

Murcia, February, 2011Protein Sequence Databases

Page 198: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

GI number ‘GenInfo identifier’ number

- In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number.

Murcia, February, 2011

Page 199: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

AC

Murcia, February, 2011

Page 200: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

GI number: ‘GenInfo identifier’ number

- If the sequence changes in any way, a new GI number will be assigned:

GI identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search.

- A separate GI number is assigned to each protein translation (alternative products)

- A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record:

http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

Murcia, February, 2011

Page 201: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

ID/AC mapping

Murcia, February, 2011

Page 202: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 203: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

http://www.ebi.ac.uk/Tools/picr/

Murcia, February, 2011

Page 204: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

GenPept

Translation from annotated CDS in GenBankContains all translated CDS annotated in

GenBank/ENA/DDBJ sequences

- equivalent to UniProtKB/TrEMBL, except that it is

redundant with other databases (Swiss-Prot, RefSeq, PIR….)

Murcia, February, 2011

Page 205: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’

Murcia, February, 2011Protein Sequence Databases

Page 206: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

RefSeq

Produced by NCBI and NLM

http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf

FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/

http://www.ncbi.nlm.nih.gov/RefSeq/

Murcia, February, 2011

Page 207: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

Protein – mRNA – genomic sequence

Also chromosomes, organelle genomes, plasmids, intermediate assembled genomic contigs, ncRNAs.

- tighly linked to Entrez Gene (« interdependent curated resources »)

Page 208: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Example: NP_000790

Page 209: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

KW

AC

Taxonomy

References

Murcia, February, 2011

Page 210: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

GenBank sourceand status

Annotation and ontologies

Murcia, February, 2011Protein Sequence Databases

Page 211: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Curated records

Page 212: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

UniProtKB vs RefSeq

Murcia, February, 2011

Page 213: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 214: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

UniProtKB/Swiss-Prot merges all CDS available for a given gene and describes the sequence differences

UniProtKB/Swiss-Prot P04150 (GCR_HUMAN):

Murcia, February, 2011Protein Sequence Databases

Page 215: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences.

- If there is an alternative splicing event, there will be several distinct entries for a given gene

Example: GCR_HUMAN

GCR_HUMANUniProtKB/Swiss-Prot

1 UniProtKB entry 7 RefSeq entriescross-linked with

Murcia, February, 2011Protein Sequence Databases

Page 216: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Protein feature annotation found in RefSeq

- Conserved domains - Signal and mature petides- Propagation of a subset of features from Swiss-Prot.

Page 217: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

PTM annotation Swiss-Prot vs

RefSeq

GCR_human

Page 218: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

RefSeq statistics

The numbers are not comparable: entries ‘sequence’ (RefSeq) vs entries ‘gene’ (UniProtKB/Swiss-Prot)

Page 219: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

SummaryUniProtKB vs NCBI protein

Murcia, February, 2011

Page 220: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

ENA/GenBank/DDBJ RefSeqwww.ncbi.nlm.nih.gov/RefSeq/

UniProtwww.uniprot.org

Protein and nucleotide data Genomic, RNA and protein data

Protein data only 

Biological data added by the submitters (gene name, tissue…)

Biological data annotated by curators, also found in the corresponding Entrez Gene entry

Biological data annotated by curators (Swiss-Prot), within the entry

Not curated  Partially manually curated (‘reviewed’ entries)

Manually curated in Swiss-Prot, not in TrEMBL 

Author submission NCBI creates from existing data + gene prediction

UniProt creates from existing data

Only author can revise (except TPA)

NCBI revises as new data emerge

UniProt revises as new data emerge

Multiple records for same loci common 

Single records for each molecule of major organisms

Single records for each protein from one gene of major organisms (in Swiss-Prot, TrEMBL is redundant)

Records can contradict each other  

Identification and annotation of discrepancy

No limit to species included   Limited to model organisms Priority (but not limited) to model organisms

Data exchanged among INSDC members 

NCBI database; collaboration with UniProt

UniProt database; collaboration with NCBI (RefSeq, CCDS)

Murcia, February, 2011Protein Sequence Databases

Page 221: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

PIR

Murcia, February, 2011

Page 222: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

PIR: the Protein Identification Resource

PIR-PSD is no more updated, but exists as an archive

Murcia, February, 2011

Page 223: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

PDB

Murcia, February, 2011

Page 224: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

PDB• PDB (Protein Data Bank), 3D structure

• Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies

• Contains also the corresponding protein sequences

*The PIR-NRL3D database makes the sequence information in PDB available for similarity searches and other tools

• Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure)

Page 225: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

PDB: Protein Data Bankwww.rcsb.org/pdb/

• Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).

• Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).

• Currently there are ~68’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) !

Murcia, February, 2011

Page 226: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

PDB: example

Murcia, February, 2011

Page 227: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Coordinates of each atom

Sequence

Murcia, February, 2011

Page 228: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Visualisation with Jmol

Murcia, February, 2011

Page 229: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

PRF

Protein Research Foundation

Murcia, February, 2011

Page 230: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

http://www.genome.jp/dbget-bin/www_bfind?prf

Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)

Murcia, February, 2011

Page 231: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 232: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Other protein databases

Murcia, February, 2011

Page 233: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Ensembl http://www.ensembl.org/

Reviewhttp://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610

Annotation pipelinehttp://www.genome.org/cgi/content/full/14/5/942

Murcia, February, 2011

Page 234: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

- Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes)

- Also do gene prediction (-> novel genes)

Ensembl= UniProtKB + RefSeq + gene prediction

- DNA, RNA and protein sequences available for several species.

- Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes.

Murcia, February, 2011

Page 235: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 236: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 237: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Example of problem (derived from gene prediction pipeline)

Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences..

ID   URAD_HUMAN            Unreviewed;       171 AA. AC   A6NGE7; DT   24-JUL-2007, integrated into UniProtKB/TrEMBL. DT   24-JUL-2007, sequence version 1. DT   02-OCT-2007, entry version 3. DE   2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE   (OHCU decarboxylase homolog) (Parahox neighbour). GN   Name=PRHOXNB; …DR   EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR   Ensembl; ENSG00000183463; Homo sapiens. DR   HGNC; HGNC:17785; PRHOXNB. PE   4: Predicted; In primates the genes coding for the enzymes for the

degradation of uric acid were inactivated and converted to pseudogenes.

Murcia, February, 2011

Page 238: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

IPIhttp://www.ebi.ac.uk/IPI/IPIhelp.html

IPI: Closure !

Murcia, February, 2011

Page 239: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 240: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity.

IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR +VEGA).

!!! Complete proteome sets include all alternative splicing sequences….

Available for human, mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cow

Murcia, February, 2011

Page 241: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Page 242: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 243: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 244: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

CCDS

Murcia, February, 2011

Page 245: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

htt

p:/

/ww

w.n

cb

i.n

lm.n

ih.g

ov/C

CD

S/

Page 246: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

CCDS (human, mouse)

Combining different approaches – ab initio, by

similarity - and taking advantage of the expertise

acquired by different institutes, including manual

annotation…

Consensus between 4 institutions…

Murcia, February, 2011

Page 247: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Murcia, February, 2011

Page 248: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

Gene Ontology (GO)

Murcia, February, 2011

Page 249: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Standards :Why is it so important ?

•‘The ever-increasing number of sequencing projects necessitates a standardized system (…) to ensure that the flood of information produced can be effectively utilized.‘ (PMID 19577473 )

•Standardization of biological data/information (data sharing and computational analysis).

•Aim: extract and compare annotation between different resources or species (semantic similarity).

Page 250: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Secreted or not secreted ?

Pubmed19299134

Page 251: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

• The Gene Ontology is a controlled vocabulary, a set of standard terms—words and phrases—used for indexing and retrieving information. In addition to defining terms, GO also defines the relationships between the terms, making it a structured vocabulary. Contains ~30’000 terms.

Gene Ontology (GO)

Page 252: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Gene Ontology (GO) terms

biological process• broad biological phenomena e.g.

mitosis, growth, digestion

molecular function• molecular role e.g. catalytic activity,

binding

cellular component• Subcellular location e.g nucleus,

ribosome, origin recognition complex

Murcia, February, 2011Protein Sequence Databases

Page 253: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

GO terms associated with human Erythropoietin

Page 254: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases
Page 255: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

http://www.geneontology.org

Page 256: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Caveats

• Annotation is the process of assigning/mapping GO terms to gene products…

• Electronic vs Manual annotation…

Murcia, February, 2011Protein Sequence Databases

Page 257: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Example with EPO

Murcia, February, 2011Protein Sequence Databases

Page 258: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Page 259: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Murcia, February, 2011Protein Sequence Databases

Page 260: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Histone H4

Murcia, February, 2011Protein Sequence Databases

!!! Large scale derived data (‘proteome’)

Page 261: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

GO terms: Essential link between biological knowledge and high throuput genomic and proteomic datasets…

PMID: 15514041

‘summary of the gene ontology classifications for all mapped ESTs…’

Murcia, February, 2011Protein Sequence Databases

Page 262: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Human proteins functional distribution

Maybe

Potentially

Putative

Expected

Probably

Hopefully

~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned).

Murcia, February, 2011Protein Sequence Databases

Page 263: Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases

All documents (including practicals) are online

http://education.expasy.org/cours/Murcia2011/

Murcia, February, 2011