First: a big welcome to the speakers We have a wonderful cast of speakers coming from all over the world; Many of them have played, directly or indirectly,

First: a big welcome to the speakers

• We have a wonderful cast of speakers coming from all over the world;

• Many of them have played, directly or indirectly, an important role in the history of Swiss-Prot;

• In the name of the organizing committee I thank all of them for having accepted to participate to this anniversary meeting.

Then: a big welcome to the UniProters

• Since 2000, every two years, the members of the Swiss-Prot groups at SIB and EBI have attended a «retreat» to discuss various aspects of their collaborations;

• This meeting doubles up as a very special retreat. It is an opportunity for attendees and speakers to tell us what they think we should be doing!

Antibes 2004

Welcome and thank you to all attendees

• For those coming from all over the world: we know it was not easy to come to Brazil;

• We hope that this meeting will be an opportunity for you to listen to interesting talks;

• But more importantly: meetings are essential to network and to start or pursue collaborations;

• So please enjoy and make the most of these four days that we will spend together.

And finally, but no the least!

a big thank you to all the sponsors

A few important last minute informations and

reminders• In the conference bag external pocket you

will find many important things including:– The pocket guide (the program at a glance);– The instructions to access the Wifi Internet

wireless service;– The vote bulletin for the best poster award;– Information about a survey concerning

UniProtKB/Swiss-Prot.

The Swiss-Prot survey

• The Swiss-Prot annotators that are carrying out the survey have a small red sticker on their name badge;

• The persons that will have answered the survey will receive a small yellow sticker to put on their name badge so that they do not get asked to participate over and over again!

Protein Spotlight book

• In your bag you will find a copy of «Tales from a small world»;

• It is a book containing all the Protein Spotlight articles published since 2000;

• We can offer a copy of this book to all of you thanks to Current Biodata who fully sponsored the cost of its printing.

Program changes• Due to flight problems (Varig!) we “lost” 3 speakers:

Terri Attwood, Philipp Bucher and Minoru Kanehisa;• We will use 2 of the 3 slots for different tutorials on

Swiss-Prot, the third slot will be used to get to the beach an hour earlier on tuesday!;

• Nasri Nahas talk will be given by Ron Appel as Nasri is busy trying to get his family out of Lebanon (last minute: they are safely back in Geneva);

• Vitek Tracz talk will be given by Matthew Cockerill who has overall responsibility for BioMed central;

• Really last minute: we just learned Gunnar fell sick on the way here and has turned back to Sweden.

Speakers• Try to end your talk 5 minutes before the alloted time

slot so as to leave the opportunity for a few questions;• There is for each day of the conference a Swiss-Prot

team member who is responsible for making sure we are on time and to moderate the question “session”;

• You will have in front of you a digital timer that will show you how much time is left;

• Once your time is over, it will ring and the moderator will make his best efforts to expel you from the podium!

Speakers - 2• You can use the podium microphones and this

will ensure that your image is captured correctly on the camera, but you can also use a wireless lapel mike;

• Please use the mouse to point on objects in your presentation instead of the laser pointer.

The SwissProt Song

Genome annotators,

with your big machines

If you didn't have Swiss-Prot,

You wouldn't find a thing

- with your big machines -

It you didn't have Swiss-Prot

you would not find a thing

Ain't no good the software

the grid and the middleware

If it'not for Swiss-Prot

You wouldn't get nowhere

- with your middleware -

If it's not for Swiss-Prot

You would not get nowhere

"Plus ça change, plus c'est la même chose…” the next 20 years

the (pre)-history of Swiss-Prot

This will not be a talk on

1953: 1st sequence (bovine insulin)

1986: 4’000 sequences

2006: 3.5 million sequences

Where will it stop?

The universe in which Swiss-Prot evolves

179'000'025'042 (179 billion)

179'000'025'0421st estimate: ~30 million species (1.5 million named)

2nd estimate: 20 million bacteria/archea x 4'000 genes

5 million protists x 6'000 genes

3 million insects x 14'000 genes

1 million fungi x 6'000 genes

0.6 million plants x 20'000 genes

0.2 million molluscs, worms, arachnids, etc. x 20'000 genes

0.2 million vertebrates x 25'000 genes

The calculation: 2x107x4000+5x106x6000+3x106x14000+106x6000+6x105x20000+2x105x20000+2x105x25000+25000(Craig Venter)+42(Douglas Adam)

Caveat: this is an estimate of the number of potential sequence entries, but not that of the number of distinct protein entities in the biosphere.

When will UniProtKB be complete?

• Swiss-Prot:– In July 2009: 500’000 entries;

– In 2013: 1 million entries;

– In 2026 (40th anniversary): 10 million entries;

– In 2036 (50th anniversary): 100 million entries.

• TrEMBL:– In May 2080 TrEMBL will have reached 10 billion entries;

– We can’t compute with Excel when we will reach 179 billion entries;

– But we are confident these dates are worthless as new sequencing techniques will have made all of these projections a very futile exercise!

Sequences…• The bread of Swiss-Prot. And yes: annotations

are the butter!;

• >99% of the protein sequences originate from translation of mRNA or genomic sequences;

• Do we still need manual intervention to cater for sequences or can we just build smart filters to obtain those we want from TrEMBL?

So what is the current status?

• A snapshot of the situation:– 28’200 entries with 82’000 sequence conflicts;– 2’600 entries with corrected frameshifts;– 15’100 entries with corrected initiation sites;– 4’300 entries with other sequence ‘problems’.

• At least 43’000 entries (19% of Swiss-Prot) required a minimal amount of curation effort so as to obtain the “correct” sequence.

Quality of protein information from genome projects

• Lets look at proteins originating from 3 different genome projects:– Drosophila: the example of what a curated

(thanks to FlyBase) genome effort should look like: only 1.8% of the gene models conflict with what we have in Swiss-Prot;

– Arabidopsis: a typical example of a genome where lots of work was spent to annotate it at the time where it was sequenced, but where nothing as been done since (at least in the public view): 19.5% of the gene models are erroneous;

– Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.

Human sequence entries as an example

• We have about 14’500 human entries in Swiss-Prot:– 4’300 entries contain information about

8’000 splice variants;– 4’600 entries contain information about

27’000 sequence variants;– 7’500 entries contain information about

22’000 sequence conflicts;– In average each human entry is produced

by merging together sequence information from 6.2 different nucleotide sequence entries.

Take home message• Producing a clean set of sequences is

not a trivial task;• It is not getting easier as more and

more type of sequence data gets submitted;

• It is important to pursue our efforts in making sure we provide to our users the most correct set of sequences for a given organism.

Post-translational modifications (PTMs)

• If sequences are important, their are generally not fully representative of the final ‘biological entity’: most proteins are the target of PTMs;

• PTMs are important at various levels, including the 3D structure, interactions, subcellular location and also the function;

• The story of the integration of PTMs in Swiss-Prot consists of 3 distinct parts;

• 1st part: a long time ago in a distant proteogalaxy:

FT MOD_RES 86 86 GAMMA-CARBOXYGLUTAMIC ACID.

FT MOD_RES 110 110 HYDROXYLATION.

FT CARBOHYD 203 203 POSSIBLE.

The 2nd phase: 2000 to 2005• Complete overhaul and significant extension of

a controlled vocabulary for PTMs;• Creation of a PTM annotation program within

the Swiss-Prot groups at SIB and EBI;• Development of new tools (Sulfinator, DGPI)

for the prediction of some PTMs;• Massive clean up and re-annotation of many

classes of PTMs.

The expanding world of PTMs

• We now have 283 different PTM descriptions (excluding processing, disulfide bonds and glycosylation events).

The new document listing post-translational modifications

Contains many information items and is available in html format or by ftp in tab-delimited format.

Finally LSEs for PTMs! • Finally «Proteoman» has arrived! And PTM

information can now be obtained from results of proteomics large scale experiments (LSE);

• In the past 12 months we have added about 6’000 experimental PTMs using data originating from some of these projects.

But LSEs are not so easy to deal with

• Issues mundane to the incorporation of LSE PTM data:– Quality:

• Trying to assess whether the methodology really allows the detection of in- vivo modifications;

• How many false positives are expected (often absent or very well hidden!);

– Accessing the data:• Often in supplementary material tables and in a variety of formats (HTML

tables, excel spreadsheets, etc.);• With a variety of identifiers (UniPRotKB, NCBI gi, pID, etc.);

– Sanity checking:• Making sure that the right sequence position is modified;• Does it make sense in the biological context;

– Propagating the information to orthologs.

• So the big issue is how will we be able to scale up and deal with the expected increase in the number of such projects!

Cross-references: then• The ‘DR’ lines were introduced in release 4 in

April 1987; they first linked Swiss-Prot to EMBL, PDB and PIR;

• They were instrumental in the development of SRS by Thure Etzold in the early 90’s;

• And also for ExPASy, the first web server in the life sciences in 1993.

2D-gel databases ANU-2DPAGEAarhus/Ghent-2DPAGECOMPLUYEAST-2DPAGEECO2DBASEHSC-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGESiena-2DPAGESWISS-2DPAGE

Family and domain databasesGene3DHAMAPInterProPANTHERPIRSFPfamPRINTSProDomPROSITESMARTTIGRFAMs

Organism-specific gene databasesAGDDictyBaseEchoBASEEcoGeneFlyBaseGeneDB_SpombeGeneFarmGrameneHGNCH-InvDB HIVLegioListLepromaListiListMaizeDBMGIMIMMypuListPhotoListRGDSagaListSGDStyGeneSubtiListTAIRTubercuListWormBaseWormPepZFIN

Enzyme and pathway databasesBioCycReactome

Miscellaneous

dbSNPGOIntActLinkHubRZPD-ProtExp

Protein family/group databasesPptaseDBGermOnlineMEROPSREBASETRANSFAC

Sequence databasesEMBLPIRUniGene

3D structure databases

HSSPPDBSMR

PTM databases

GlycoSuiteDBPhosSite

UniProtKB/Swiss-Prot

explicit links

Genome annotation databasesEnsemblGenomeReviewsTIGR

Cross-references: now• There are now cross-references from Swiss-Prot to

74 different databases (6 more are in the pipeline);• Almost 3 million DR lines: an average of 12 per

entry;• Many other links to external resources are also

available through the OX (NCBI taxonomy), RX (PubMed, DOI), CC («Web resource» topic) and FT lines (dbSNP);

• Cross-references are not only a mean to help navigate between resources, they sometimes add information to the entries.

Examples of cross-references that provide information

• The cross-references to the Gene Ontology (GO):DR GO; GO:0005634; C:nucleus; ISS.

DR GO; GO:0005515; F:protein binding; IPI.

DR GO; GO:0007165; P:signal transduction; TAS.

• The PDB cross-references include information on the mapping of the structure on the sequence:DR PDB; 1QQG; X-ray; A/B=4-267.

• The cross-references to domain databases include information on the name/acronyms of the domains and the number of occurrences of these domains:DR PROSITE; PS50026; EGF_3; 2.

DR PROSITE; PS50092; TSP1; 3.

DR PROSITE; PS01208; VWFC_1; 1.

From sequences to structures..and back!

• Efficient bidirectional links between UniProtKB and PDB/MSD are very important;

• Currently 10’000 Swiss-Prot entries are linked to 30’200 PDB entries;

• These links are constantly updated and verified; the converse is unfortunatly still not yet true;

• We have always made use of 3D structure information to help in the annotation process;

• But we are only now starting to systematically mine 3D structures to extract various information such as disulfide bonds, metal-binding sites, active sites, etc.

So what is the future of cross-references?

• Will we really need hard-coded cross-references in the future?

• Can we gradually replace some of them by computed «on the fly» links using referenceable objects?

• Will we make more use of client-server systems such as the distributed annotation system (DAS)?

• The answer is obviously dependent on standardization;

• But the Life Sciences are still living in the dark ages of the tower of Babel

CVs and ontologies• Since the very beginning of Swiss-Prot we have been

building a growing sets of controlled vocabularies (««ontologies»»);

• Species, strains, plasmids, journals, tissues, PTMs; domain names and, of course, keywords are all «under control» (see posters SP117 and SP120);

• We are very well advanced in the process of having a CV for pathways (see the UniPathway poster; SP140);

• We are now tackling the problems of protein and gene names (see poster SP118). But this is of course not very easy!

Do we need annotations?

• Annotators spend a big part of their time capturing and synthesizing a huge amount of «functional» information;

• For example we populate Swiss-Prot with data relevant to the:– Role and function of the proteins;

– Subcellular location;

– Interactions (binary and “complex”);

– Tissue specificity, developmental stage;

– Involvement in diseases.

• We have many «anecdotal» evidence that users find this very important and that this is one of the important hallmark of Swiss-Prot. Yet is this really true?

Do we need annotations? – part 2

• This is a time consuming process and we will never be complete and up-to-date;

• Many users want quick and easy to «summarize» answers, yet the more detailed an entry becomes the less it is easy to transform it into a summarizable entity;

• We are often the victims of the «fasta format syndrome»: users expect everything important about a protein to be available in the header of a fasta format entry!;

• So should we continue?

Yes we need annotation!

• Because (among many other reasons):– Automatized annotation is the only way to transfer

knowledge from a model organism to a less studied one;

– To apply such techniques safely one needs template entries that are representative of the state of the knowledge;

– While literature mining tools could be conceived as a way to automatically build a summary view of the knowledge around a given protein, these techniques are not yet powerful enough to create a coherent synthetic view;

– Literature mining tools also require the existence of well annotated (corpus) entries.

From pull to push..• For now more than 20 years we have

been «pulling» information and knowledge from various sources, but mainly from literature;

• It is now time to make sure that the next 20 years will be defined by the fact that researchers «push» their results and the interpretation of their results in the knowledgebase.

• Attempt to try to get the community to directly submit information on the proteins that they are studying;

• Using a wikepedia-type model/interface;• Will first be «field-tested» in the yeast community;• We are hopeful, yet we are realist: only a small

percentage of life researchers will take the time and are altruistic enough to fully participate in such a scheme.

Grey grey matter counts!

• Many life scientists with knowledge of the molecular world and that are computer-proficient are reaching retirement age;

• Some want to continue to play a role in the advancement of research, yet they will not be able to do lab work anymore;

• We should offer them the tools necessary for them to contribute to the annotation process.

Anabelle and Asterix• Two important tools could contribute to the

democratization of Swiss-Prot style annotation:– Anabelle: a web based protein sequence analysis

platform;– Asterix: the new Swiss-Prot editor.

Anabelle selection module

Viewer Layout:Link to entry NiceProt view

Blast (full) entry

Blast uncharted region

Link to most similar entry NiceProt view

Align most similar entry with entry

more links!...

Links…

Link to InterPro

Link to domain original database

And here is what the users gets back

But what about the rest of the life

scientists?• We saw how we could get parents (adopt a

protein) and grand parents (grey matter count) involvements, but what about the children…..;

• …the young researchers, those who are active in producing new knowledge?

Two carrots, a stick and lots of education!

• The carrots:– Making sure that granting agencies see favorably

the involvement of researchers in the process of submitting information to databases;

– The same criteria should be considered by any hiring or promotion committee;

• The stick: getting journal editors to refuse to accept to publish a paper if the results have not been submitted to the relevant knowledge resources;

Education!• Everyone should feel concerned;• Awareness of the content and usage of

knowledge resources is a pre-requisite to do any type of « serious » research in the field of molecular life sciences;

• Organizations such as EMBNet, EBI, SIB, NCBI, NIG should continue and strenghten their «outreach» efforts;

• We (databases providers) should do more in term of providing tutorials (on-line and on-site).

An important issue…• The process of developing a data resource for

the Life Sciences is akin to the work of middle age copists, renaissance encyclopedists or the 19th century OED development : it is a very tedious, manually intensive, long term job…

How to get funding for knowledge infrastructures

in the life sciences?• Funding knowledge resources is difficult:

– It’s a very long term process;– It’s not prestigious;– and its not cheap!

And its not only databases that are endangered!

Service groups are also at risk

Proposition for a new tax

• Each grant proposal for a high throughput data-producing project would be obliged to set aside a predefined percentage of the grant money to help cover the cost of storing and managing the produced data;

• How this money would be redistributed is not trivial to define and even less to implement;

• The priority would be to use this tax as a financial tool to help fund the data repositories.

The tax for Biomolecular data archival

The 6 observations of a « databaser »

1. Your task will be much more complex and far bigger that you ever thought it could be;

2. If your database is successful and useful to the user community, then you will have to dedicate all your efforts to develop it for a much longer period of time than you would have thought possible;

3. You will always wonder why life scientists abhor complying with nomenclature guidelines or standardization efforts that would simplify your and their life;

4. You will have to continually fight to obtain a minimal amount of funding;

5. As with any service efforts, you will be told far more what you do wrong rather than what you do right;

6. But when you will see how useful your efforts are to your users, all the above drawbacks will loose their importance!!

Aiala, Alain x4, Alan x4, Alastair, Alex x2, Alexander x2, Alexandre x2, Alice, Alistair, Allyson, Alvis, Amanda, Ana Tereza, Anastasia, Andre x3, Andrea, Andreas, Andrew, Angela, Anne x4, Anne-Lise, Anthony, Antoine, Anulka, Arnaud x2, Arthur, Astrid, Athel, Barbara x2, Barend, Baris, Barry, Bart, Bastien, Bengt, Bernard x2, Bernd, Bernhard x2, Bill, Bob, Brigitte, Bruno x2, Burkhard, Carl, Carola, Carolyn, Catherine x4, Cathy x2, Cecile x2, Cecilia, Cedric, Cesare, Chantal x3, Charles x2, Chris, Chrissie, Christian x3, Christiane, Christine x2, Christoph, Christophe, Christopher, Christos, Claude, Claudia x2, Claudine, Colin, Colombe, Corinne, Cristiano, Damien, Dan, Dana, Daniel x3, Daniela, Danielle, Darcy, Darren, Dave x2, David x5, Delphine, Denis x2, Dennis, Des, Dietmar, Dolnide, Dominique, Doron, Dorothy, Doug, Duncan, Eddie, Edgar, Edouard, Eleanor, Elisabeth x2, Elmar, Elvis, Emily, Emmanuel, Eric x3, Erik, Ernest, Ernst, Esther, Eugene x2, Eva, Eve, Evelyn, Evgenia, Evgeny, Ewan, Fabrice, Fiona, Flavio, Florence x3, Fotis, Francis, Frank, François x3, Frederic, Frederique x2, Gabriel, Gabriella, Ganesh, Gaston, Geoff, Gerry, Gert, Ghislaine, Gilbert, Gill, Goran, Gottfried, Graham x2, Greg, Gregoire, Guido, Guillaume, Gunnar, Guy x2, Guy-Olivier, Hanah, Heidi, Henning, Hien, Hilde, Holger, Hongzhan, Howard, Hsing-Kuo, Ian, Iirit, Ilkka, Ioannis, Irving, Isabelle x2, Ivan x2, Ivo, Jack, Jacques x2, Jaime, Janet x2, Jean-Charles, Jean-François, Jean-Jacques, Jean-Michel, Jean-Pierre x2, Jeffrey, Jenny, Jerome, Jim, Jingchu, Joachim, Joanna, Joel, John x7, Jonas, Jonathan x2, Jorja, Jos, Juan, Juergen, Julia, Julio, Julius, Kai, Karin, Karine, Kate, Kati, Katja, Katsumi, Kay, Keiichi, Keith x3, Ken x2, Kenta, Khaled, Kirill, Kirsty, Kristian, Larry, Laure, Laurent x3, Lee, Leigh, Leon, Li, Lina, Lionel, Lisa x2, Livia, Lorenza, Lorenzo, Louise, Luca, Luciane, Lucien, Luisa x2, Luiz, Lydie x2, Ma'ayan, Madelaine, Maggie, Mahesh, Manolo, Manuel x2, Manuela, Marc x6, Marcia, Marco, Margaret x2, Mari Trini, Maria, Maria Esperanza, Maria-Jesus, Marie-Claude, Marilyn, Marisa, Mark x2, Martin x2, Martine, Marvin, Mary, Massimo, Matteo, Matthew, Mauricio, Michael x7, Michel x3, Michele, Michelle, Miguel, Mike x2, Minna, Minoru, Monica, Monika, Morido, Nabil, Nadeem, Nadine x2, Naruya, Nasri, Natalia, Nathalie, Neil x2, Nicky, Nicola, Nicolas x3, Nicole x3, Nicoletta, Nicolle, Nikos, Nina, Oliver, Olivier x4, Orna, Owen, Paolo, Pascal, Pat, Patricia x6, Patrick x5, Paul, Paula, Pavel, Pedro, Peer, Peter x7, Petra, Phil x2, Philip, Philippe x3, Pierre, Pierre-Alain, Pieter, Piotr, Rachael, Raffaella, Rainer, Raja, Rasko, Raton laveur, Rebecca x2, Rein, Reinhard x2, Remi, Reto, Reynaldo, Rich, Richard, Robert x2, Roberto, Robin, Rodger, Rodrigo, Roland, Ron, Rosita, Ross, Roy, Russ x2, Ruth x3, Saeid, Salvo, Samia, Samuel x2, Sandor, Sandra x2, Sandrine, Sarah, Scott, Sebastien x2, Serenella, Sergio, Severine x2, Shigehaki, Shmuel, Shoko, Shoshana, Shyamala, Silvia x2, Sineaid, Siv, Sona, Soren, Sorogini, Steffen x2, Steffi, Stephanie x2, Steve, Steven, Stuart x2, Stylianos, Sunil, Sylvain, Sylvie x2, Takashi, Tamara, Tammera, Tania x2, Temple, Terri, Terry, Thomas x3, Thure, Tim x2, Timothy, Toby, Tom, Toni, Torsten, Ujwal, Ulrich, Ursula, Valeria, Vassilios, Veronique, Vicente, Victor x2, Vincent, Vinnei, Violaine, Virginie x2, Vitaliano, Vitek, Vivien x2, Vivienne, Wanessa, Wei mun, Weimin, William, Williams, Willy, Winona, Winston, Witek, Wolfgang, Xavier x2, Yasmin, Yasuhiro, Yongxing, Yoshio, Youla, Young-Ki, Zeev, Zhang-Zhi.

Documents

First: a big welcome to the speakers We have a wonderful cast of speakers coming from all over the world; Many of them have played, directly or indirectly,