50
1 THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, May 2000 Amos Bairoch Swiss Institute of Bioinformatics (SIB) Centre Medical Universitaire (CMU) 1, rue Michel Servet 1211 Geneva 4 Switzerland Telephone: +41-22-702 58 60 Fax: +41-22-702 58 58 Electronic mail address: [email protected] WWW server: http://www.expasy.ch/ Rolf Apweiler The EMBL Outstation - The European Bioinformatics Institute (EBI) Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: +44-1223-494 444 Fax: +44-1223-494 468 Electronic mail address: [email protected] WWW server: http://www.ebi.ac.uk/

THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

Embed Size (px)

Citation preview

Page 1: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

1

THE SWISS-PROT PROTEIN SEQUENCEDATABASE USER MANUAL

Release 39, May 2000

Amos BairochSwiss Institute of Bioinformatics (SIB)Centre Medical Universitaire (CMU)1, rue Michel Servet1211 Geneva 4Switzerland

Telephone: +41-22-702 58 60Fax: +41-22-702 58 58Electronic mail address: [email protected] server: http://www.expasy.ch/

Rolf ApweilerThe EMBL Outstation - The European Bioinformatics Institute (EBI)Wellcome Trust Genome CampusHinxtonCambridge CB10 1SDUnited Kingdom

Telephone: +44-1223-494 444Fax: +44-1223-494 468Electronic mail address: [email protected] server: http://www.ebi.ac.uk/

Page 2: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

2

Acknowledgements

This release of SWISS-PROT has been prepared by:

• Amos Bairoch, Kristian Axelsen, Marie-Claude Blatter-Garin, Brigitte Boeckmann, Silvia Braconi Quintaje,Florence Brunner, Danielle Coral, Sylvie Dethiollaz, Livia Famiglietti, Nathalie Farriol-Mathis, SerenellaFerro, Elisabeth Gasteiger, Alain Gateau, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski,Chantal Hulo, Nicolas Hulo, Janet James, Silvia Jimenez, Eva Jung, Corinne Lachaize, Karine Michoud,Madelaine Moinat, Bruno Pardo, Catherine Rivoire, Bernd Roechert, Claudia Sapsezian, Christian Sigrist,Shyamala Sundaram, Anne-Lise Veuthey, Julia Williams-Nef and Nadine Zangger at the Swiss Institute ofBioinformatics (SIB) and the Medical Biochemistry Department of the University of Geneva;

• Rolf Apweiler, Kirsty Bates, Sergio Contrino, Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser, Cathy

Gedman, Henning Hermjakob, Vivien Junker, Alexander Kanapin, Youla Karavidopoulou, Paul Kersey,Evguenia Kriventseva, Fiona Lang, Minna Lehvaslaiho, Michele Magrane, Maria Jesus Martin, NicolettaMitaritonna, Virginie Mittard, Steffen Moeller, Nicola Mulder, Claire O'Donovan, Tom Oinn, John O’Rourke,Isabelle Phan, Sandrine Pilbout, Lucia Rodriguez-Monge, Margaret Shore-Nye, Eleanor Whitfield, AllysonWilliams and Evgueni Zdobnov at the European Bioinformatics Institute (EBI).

SWISS-PROT contains sequences translated from the EMBL Nucleotide Sequence Database, prepared bythe European Bioinformatics Institute. For a recent reference see: Baker W., van den Broek A., Camon E.,Hingamp P., Sterk P., Stoesser G. and Tuli M.A.; Nucleic Acids Res. 28:19-23(2000). A small part of the information in SWISS-PROT was originally adapted from information contained in theProtein Sequence Database of the Protein Information Resource (PIR). For a recent reference see: BarkerW.C., Garavelli J.S., McGarvey P.B., Orcutt B.C., Srinivasarao G.Y., Xiao C., Yeh L.-S.L, Ledley R.S., JandaJ.F., Pfeiffer F., Mewes H.-W., Tsugita A. and Wu C.; Nucleic Acids Res. 28:41-44(2000). Cross-references are made in SWISS-PROT to: • The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch and Philipp Bucher at

the Swiss Institute of Bioinformatics. Reference: Hofmann K., Bucher P., Falquet L. and Bairoch A.; Nucleic Acids Res. 27:215-219(1999).

• The Pfam database of protein domains prepared under the supervision of Richard Durbin and Sean Eddy.

Reference: Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L. and Sonnhammer E.L.L.; NucleicAcids Res. 28:263-266(2000).

• The PRINTS database of protein fingerprints prepared under the supervision of Terri Attwood at the

University of Manchester.Reference: Attwood T.K., Croning M.D.R., Flower D.R., Lewis A.P., Mabey J.E., Scordis P., Selley J.N.and Wright W.; Nucleic Acids Res. 28:225-227(2000).

• The 3D macromolecular structure Protein Data Bank (PDB) prepared by Research Collaboratory forStructural Bioinformatics (RCSB). Reference: Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N. andBourne P.E.; Nucleic Acids Res. 28:235-242(2000).

• The database of Homology-derived Secondary Structure of Proteins (HSSP) compiled at the European

Bioinformatics Institute (EBI). Reference: Holm L. and Sander C.; Nucleic Acids Res. 27:244-247(1999).

Page 3: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

3

• The DictyDb database prepared by Douglas W. Smith and Bill Loomis from the University of California,

San Diego (UCSD).Reference: Smith D.W. and Loomis W.F.; Proceedings of the International Dictyostelium Conference '96.

• The Drosophila genome database (FlyBase) prepared under the supervision of Michael Ashburner at the

Department of Genetics, University of Cambridge. Reference: Nucleic Acids Res. 27:85-88(1999).

• The EcoGene Escherichia coli K12 and StyGene Salmonella typhimurium LT2 genome databases, both

prepared by Ken Rudd at the Department of Biochemistry and Molecular Biolology of the University ofMiami School of Medicine. Reference: Rudd K.E.; Nucleic Acids Res. 28:60-64(2000).

• The Maize genome database (MaizeDB) developed by the USDA-ARS Maize Genome Project as part of

the National Agricultural Library's Plant Genome Research Program. • The Online Mendelian Inheritance in Man database (OMIM) prepared under the supervision of Victor

McKusick at John Hopkins University. Reference: Hamosh A., Scott A.F., Amberger J., Valle D. and McKusick V.A.; Hum. Mutat. 15:57-61(2000).

• The Mouse Genome Database (MGD) prepared by the Mouse Genome Informatics group at Jackson

Laboratory. Reference: Blake J.A., Eppig J.T., Richardson J.E. and Davisson M.T.; Nucleic Acids Res. 28:108-111(2000).

• The Saccharomyces Genome Database (SGD) prepared under the supervision of Mike Cherry at

Stanford. Reference: Ball C.A., Dolinski K., Dwight S.S., Harris M.A., Issel-Tarver L., Kasarskis A., Scafe C.R.,Sherlock G., Binkley G., Jin H., Kaloper M., Orr S.D., Schroeder M., Weng S., Zhu Y., Botstein D. andCherry J.M.; Nucleic Acids Res. 28:77-80(2000).

• The SubtiList relational database for the Bacillus subtilis 168 genome prepared under the supervision of

Ivan Moszer at the Pasteur Institute. Reference: Moszer I.; FEBS Lett. 430:28-36(1998).

• The TubercuList relational database for the Mycobacterium tuberculosis H37Rv genome prepared under

the supervision of Stewart Cole at the Pasteur Institute.

• The WormPep database prepared by Richard Durbin and Erik Sonnhammer from the MRC Laboratory ofMolecular Biology and Sanger Center at Hinxton Hall, Cambridge. Reference: Sonnhammer E.L.L. and Durbin R.; Genomics 46:200-216(1997).

• The Zebrafish Information Network (ZFIN) database prepared by the Institute of Neuroscience of the

University of Oregon. Reference: Westerfield M., Doerry E., Kirkpatrick A.E. and Douglas S.A.; Meth. Cell Biol. 60:339-355(1999).

Page 4: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

4

• The G-protein--coupled receptor database (GCRDb) prepared by Lee Frank Kolakowski at the Department

of Pharmacology of the University of Texas, San Antonio. Reference: Kolakowski L.F. Jr.; Receptors Channels 2:1-7(1994).

• The restriction enzymes database (REBASE) prepared by Richard Roberts and Dana Macelis at New

England BioLabs. Reference: Roberts R.J. and Macelis D.; Nucleic Acids Res. 28:306-307(2000).

• The transcription factor database (TRANSFAC) developed under the supervision of Edgar Wingender from

the Gesellschaft fuer Biotechnologische Forschung mbH in Braunschweig. Reference: Wingender E., Chen X., Hehl R., Karas H., Liebich I., Matys V., Meinhardt T., Pruess M.,Reuter I. and Schacherer F.; Nucleic Acids Res. 28:316-319(2000).

• The Encyclopedia of Escherichia coli genes and metabolism (EcoCyc) prepared under the supervision of

Peter Karp at Pangea Systems and Monica Riley at MBL. Reference: Karp P.D., Riley M., Saier M., Paulsen I.T., Paley S.M. and Pellegrini-Toole A.; Nucleic AcidsRes. 28:56-59(2000).

• The 2D-gel protein database (SWISS-2DPAGE) prepared under the responsibility of Denis Hochstrasser

and Ron Appel at the Swiss Institute of Bioinformatics. Reference: Hoogland C., Sanchez J.-C., Tonella L., Binz P.-A., Bairoch A., Hochstrasser D.F. and AppelR.D.; Nucleic Acids Res. 28:286-288(2000).

• The gene-protein database of Escherichia coli K12 (2D-gel spots) (ECO2DBASE) prepared under the

supervision of Ruth VanBogelen. Reference: VanBogelen R.A., Abshire K.Z., Moldover B., Olson E.R. and Neidhardt F.C.; Electrophoresis18:1243-1251(1997).

• The Harefield Hospital 2D gel protein databases prepared under the supervision of Mike Dunn.

Reference: Corbett J.M., Wheeler C.H., Baker C.S., Yacoub M.H. and Dunn M.J.; Electrophoresis15:1459-1465(1994).

• The human keratinocyte 2D-gel protein database from the universities of Aarhus and Ghent.

Reference: Celis J.E., Rasmussen H.H., Olsen E., Madsen P., Leffers H., Honore B., Dejgaard K., GromovP., Hoffmann H.J., Nielsen M., Vassiliev A., Vintermyr O., Hao J., Celis A., Basse B., Lauridsen J.B., RatzG.P., Andersen A.H., Walbum E., Kjaergaard I., Puype M., Van Damme J. and Vandekerckhove J.;Electrophoresis 14:1091-1198(1993).

• The Yeast Electrophoresis Protein Database (YEPD) prepared under the supervision of Jim Garrells from

Proteome Inc. Reference: Payne W.E. and Garrels J.I.; Nucleic Acids Res. 25:57-62(1997).

• The Human Retroviruses and AIDS compilation of nucleic and amino acid sequences (HIV Sequence

Database) edited by G. Myers, A.B. Rabson, S.F. Josephs, T.F. Smith, J.A. Berzofsky, F. Wong-Staal;published by the Theoretical Biology and Biophysics Group T-10 at Los Alamos National Laboratory; andfunded by the AIDS program of the National Institute of Allergy and Infectious Diseases through aninteragency agreement with the United States Department of Energy.

Page 5: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

5

Copyright notice SWISS-PROT is copyright. It is produced through a collaboration between the Swiss Institute ofBioinformatics and the EMBL Outstation - the European Bioinformatics Institute. There are no restrictions onits use by non-profit institutions as long as its content is in no way modified. Usage by and for commercialentities requires a license agreement. For information about the licensing scheme see: http://www.isb-sib.ch/announce/ or send an email to [email protected]. The above copyright notice also applies to this user manual as well as to any other SWISS-PROTdocuments. How to submit data or updates/corrections to SWISS-PROT To submit new sequence data to SWISS-PROT and for all queries regarding the submission of SWISS-PROT one should contact: SWISS-PROT The EMBL Outstation - The European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: (+44 1223) 494 462 Telefax: (+44 1223) 494 468 E-mail: [email protected] (for submissions); [email protected] (for queries) To submit updates and/or corrections to SWISS-PROT you can either use the E-mail address: [email protected] or the WWW address:

http://www.expasy.ch/sprot/sp_update_form.html

Citation If you want to cite SWISS-PROT in a publication, please use the following reference:

Bairoch A. and Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28:45-48(2000).

Page 6: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

6

Table of contents 1) What is SWISS-PROT? 2) Conventions used in the database 2.1 General structure of the database 2.2 Classes of data 2.3 Structure of a sequence entry 3) The different line types 3.1 The ID line 3.2 The AC line 3.3 The DT line 3.4 The DE line 3.5 The GN line 3.6 The OS line 3.7 The OG line 3.8 The OC line 3.9 The reference (RN, RP, RC, RX, RA, RT, RL) lines 3.10 The CC line

3.11 The DR line 3.12 The KW line 3.13 The FT line 3.14 The SQ line 3.15 The sequence data line 3.16 The // line Appendix A: Feature table keys A.1 Change indicators A.2 Amino acid modifications A.3 Regions A.4 Secondary structure A.5 Others Appendix B: Amino acid codes Appendix C: Format differences between the SWISS-PROT and EMBL databases C.1 Generalities C.2 Differences in line types present in both databases C.3 Line types defined by SWISS-PROT but currently not used by EMBL C.4 Line types defined by EMBL but currently not used by SWISS-PROT

Page 7: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

7

(1). What is SWISS-PROT?

SWISS-PROT is an annotated protein sequence database. It was established in 1986 and maintainedcollaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry ofthe University of Geneva and now at the Swiss Institute of Bioinformatics (SIB) and the EMBL Data Library(now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The SWISS-PROT proteinsequence database consists of sequence entries. Sequence entries are composed of different line types,each with their own format. For standardization purposes the format of SWISS-PROT follows as closely aspossible that of the EMBL Nucleotide Sequence Database.

The SWISS-PROT database distinguishes itself from other protein sequence databases by four distinctcriteria:

a) Annotation

In SWISS-PROT, as in most other sequence databases, two classes of data can be distinguished: the coredata and the annotation.

For each sequence entry the core data consists of:

• The sequence data;• The citation information (bibliographical references);• The taxonomic data (description of the biological source of the protein). The annotation consists of the description of the following items:

• Function(s) of the protein;• Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor,

etc.;• Domains and sites. For example calcium binding regions, ATP-binding sites, zinc fingers, homeoboxes,

SH2 and SH3 domains, kringle, etc.;• Secondary structure. For example alpha helix, beta sheet, etc.;• Quaternary structure. For example homodimer, heterotrimer, etc.;• Similarities to other proteins;• Disease(s) associated with deficiencie(s) in the protein;• Sequence conflicts, variants, etc. We try to include as much annotation information as possible in SWISS-PROT. To obtain this information weuse, in addition to the publications that report new sequence data, review articles to periodically update theannotations of families or groups of proteins. We also make use of external experts, who have been recruitedto send us their comments and updates concerning specific groups of proteins.

We believe that our having systematic recourse both to publications other than those reporting the core dataand to subject referees represents a unique and beneficial feature of SWISS-PROT.

Page 8: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

8

In SWISS-PROT, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in thekeyword lines (KW). Most comments are classified by ‘topics'; this approach permits the easy retrieval ofspecific categories of data from the database.

b) Minimal redundancy

Many sequence databases contain, for a given protein sequence, separate entries which correspond todifferent literature reports. In SWISS-PROT we try as much as possible to merge all these data so as tominimize the redundancy of the database. If conflicts exist between various sequencing reports, they areindicated in the feature table of the corresponding entry.

c) Integration with other databases

It is important to provide the users of biomolecular databases with a degree of integration between the threetypes of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiarystructures) as well as with specialized data collections. SWISS-PROT is currently cross-referenced withabout 30 different databases. Cross-references are provided in the form of pointers to information related toSWISS-PROT entries and found in data collections other than SWISS-PROT. This extensive network ofcross-references allows SWISS-PROT to play a major role as a focal point of biomolecular databaseinterconnectivity.

d) Documentation

SWISS-PROT is distributed with a large number of index files and specialized documentation files. Some ofthese files have been available for a long time (this user manual, the release notes, the various indices forauthors, citations, keywords, etc.), but many have been created recently and we are continuously adding newfiles. The release notes contain an up to date descriptive list of all distributed document files.

Page 9: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

9

(2). Conventions used in the database

The following sections describe the general conventions used in SWISS-PROT to achieve uniformity ofpresentation. Experienced users of the EMBL Database can skip these sections and directly refer toAppendix C, which lists the minor differences in format between the two data collections.

(2.1). General structure of the database The SWISS-PROT protein sequence database is composed of sequence entries. Each entry corresponds toa single contiguous sequence as contributed to the bank or reported in the literature. In some cases, entrieshave been assembled from several papers that report overlapping sequence regions. Conversely, a singlepaper can provide data for several entries, e.g. when related sequences from different organisms arereported.

References to positions within a sequence are made using sequential numbering, beginning with 1 at the N-terminal end of the sequence.

Except for initiator N-terminal methionine residues, which are not included in a sequence when their absencefrom the mature sequence has been proven, the sequence data correspond to the precursor form of a proteinbefore post-translational modifications and processing.

(2.2). Classes of data In order to attempt to make data available to users as quickly as possible after publication, SWISS-PROT isnow distributed with a supplement called TrEMBL, where entries are released before all their details arefinalized. To distinguish between fully annotated entries and those in TrEMBL, the 'class' of each entry isindicated on the first (ID) line of the entry. The two defined classes are:

STANDARD Data which are complete to the standards laid down by the SWISS-PROTdatabase.

PRELIMINARY Sequence entries which have not yet been annotated by the SWISS-PROT staffup to the standards laid down by SWISS-PROT. These entries are exclusivelyfound in TrEMBL.

(2.3). Structure of a sequence entry

The entries in the SWISS-PROT database are structured so as to be usable by human readers as well as bycomputer programs. The explanations, descriptions, classifications and other comments are in ordinaryEnglish. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists areused.

Page 10: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

10

Each sequence entry is composed of lines. Different types of lines, each with their own format, are used torecord the various data that make up the entry. A sample sequence entry is shown below.

ID GRAA_HUMAN STANDARD; PRT; 262 AA.AC P12544;DT 01-OCT-1989 (Rel. 12, Created)DT 01-OCT-1989 (Rel. 12, Last sequence update)DT 15-DEC-1998 (Rel. 37, Last annotation update)DE GRANZYME A PRECURSOR (EC 3.4.21.78) (CYTOTOXIC T-LYMPHOCYTE PROTEINASEDE 1) (HANUKKAH FACTOR) (H FACTOR) (HF) (GRANZYME 1) (CTL TRYPTASE)DE (FRAGMENTIN 1).GN GZMA OR CTLA3 OR HFSP.OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.RN [1]RP SEQUENCE FROM N.A.RC TISSUE=T-cell;RX MEDLINE; 88125000.RA Gershenfeld H.K., Hershberger R.J., Shows T.B., Weissman I.L.;RT "Cloning and chromosomal assignment of a human cDNA encoding a TRT cell- and natural killer cell-specific trypsin-like serineRT protease.";RL Proc. Natl. Acad. Sci. U.S.A. 85:1184-1188(1988).RN [2]RP SEQUENCE OF 29-53.RX MEDLINE; 88330824.RA Poe M., Bennett C.D., Biddison W.E., Blake J.T., Norton G.P.,RA Rodkey J.A., Sigal N.H., Turner R.V., Wu J.K., Zweerink H.J.;RT "Human cytotoxic lymphocyte tryptase. Its purification from granulesRT and the characterization of inhibitor and substrate specificity.";RL J. Biol. Chem. 263:13215-13222(1988).RN [3]RP SEQUENCE OF 29-40, AND CHARACTERIZATION.RX MEDLINE; 89009866.RA Hameed A., Lowrey D.M., Lichtenheld M., Podack E.R.;RT "Characterization of three serine esterases isolated from human IL-2RT activated killer cells.";RL J. Immunol. 141:3142-3147(1988).RN [4]RP SEQUENCE OF 29-39, AND CHARACTERIZATION.RX MEDLINE; 89035468.RA Kraehenbuhl O., Rey C., Jenne D.E., Lanzavecchia A., Groscurth P.,RA Carrel S., Tschopp J.;RT "Characterization of granzymes A and B isolated from granules ofRT cloned human cytotoxic T lymphocytes.";RL J. Immunol. 141:3471-3477(1988).RN [5]RP 3D-STRUCTURE MODELING.RX MEDLINE; 89184501.RA Murphy M.E.P., Moult J., Bleackley R.C., Gershenfeld H.,RA Weissman I.L., James M.N.G.;RT "Comparative molecular model building of two serine proteinases fromRT cytotoxic T lymphocytes.";

Page 11: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

11

RL Proteins 4:190-204(1988).CC -!- FUNCTION: THIS ENZYME IS NECESSARY FOR TARGET CELL LYSIS IN CELL-CC MEDIATED IMMUNE RESPONSES. IT CLEAVES AFTER LYS OR ARG. MAY BECC INVOLVED IN APOPTOSIS.CC -!- CATALYTIC ACTIVITY: HYDROLYSIS OF PROTEINS, INCLUDING FIBRONECTIN,CC TYPE IV COLLAGEN AND NUCLEOLIN. PREFERENTIAL CLEAVAGE: ARG-|-XAA,CC LYS-|-XAA >> PHE-|-XAA IN SMALL MOLECULE SUBSTRATES.CC -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED.CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC GRANULES.CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S1; ALSO KNOWN AS THECC TRYPSIN FAMILY. STRONGEST TO OTHER GRANZYMES AND TO MAST CELLCC PROTEASES. CC -------------------------------------------------------------------------- CC This SWISS-PROT entry is copyright. It is produced through a collaboration CC between the Swiss Institute of Bioinformatics and the EMBL outstation - CC the European Bioinformatics Institute. There are no restrictions on its CC use by non-profit institutions as long as its content is in no way CC modified and this statement is not removed. Usage by and for commercial CC entities requires a license agreement (See http://www.isb-sib.ch/announce/ CC or send an email to [email protected]). CC --------------------------------------------------------------------------DR EMBL; M18737; AAA52647.1; -.DR PIR; A28943; A28943.DR PIR; A30525; A30525.DR PIR; A30526; A30526.DR PIR; A31372; A31372.DR PDB; 1HF1; 15-OCT-94.DR MIM; 140050; -.DR INTERPRO; IPR001254; -.DR PFAM; PF00089; trypsin; 1.DR PROSITE; PS00134; TRYPSIN_HIS; 1.DR PROSITE; PS00135; TRYPSIN_SER; 1.KW Hydrolase; Serine protease; Zymogen; Signal; T-cell; Cytolysis;KW Apoptosis; 3D-structure.FT SIGNAL 1 26FT PROPEP 27 28 ACTIVATION PEPTIDE.FT CHAIN 29 262 GRANZYME A.FT ACT_SITE 69 69 CHARGE RELAY SYSTEM (BY SIMILARITY).FT ACT_SITE 114 114 CHARGE RELAY SYSTEM (BY SIMILARITY).FT ACT_SITE 212 212 CHARGE RELAY SYSTEM (BY SIMILARITY).FT DISULFID 54 70 BY SIMILARITY.FT DISULFID 148 218 BY SIMILARITY.FT DISULFID 179 197 BY SIMILARITY.FT DISULFID 208 234 BY SIMILARITY.FT CARBOHYD 170 170 N-LINKED (GLCNAC...) (POTENTIAL).SQ SEQUENCE 262 AA; 28968 MW; DA87363A0D92BAF4 CRC64; MRNSYRFLAS SLSVVVSLLL IPEDVCEKII GGNEVTPHSR PYMVLLSLDR KTICAGALIA KDWVLTAAHC NLNKRSQVIL GAHSITREEP TKQIMLVKKE FPYPCYDPAT REGDLKLLQL TEKAKINKYV TILHLPKKGD DVKPGTMCQV AGWGRTHNSA SWSDTLREVN ITIIDRKVCN DRNHYNFNPV IGMNMVCAGS LRGGRDSCNG DSGSPLLCEG VFRGVTSFGL ENKCGDPRGP GVYILLSKKH LNWIIMTIKG AV//

Page 12: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

12

Each line begins with a two-character line code, which indicates the type of data contained in the line. Thecurrent line types and line codes and the order in which they appear in an entry, are shown in the tablebelow.

Line code Content Occurrence in an entry ID Identification Once; starts the entry AC Accession number(s) One or more DT Date Three times DE Description One or more GN Gene name(s) Optional OS Organism species One or more OG Organelle Optional OC Organism classification One or more RN Reference number One or more RP Reference position One or more RC Reference comment(s) Optional RX Reference cross-reference(s) Optional RA Reference authors One or more RT Reference title Optional RL Reference location One or more CC Comments or notes Optional DR Database cross-references Optional KW Keywords Optional FT Feature table data Optional SQ Sequence header Once (blanks) sequence data One or more // Termination line Once; ends the entry

As shown in the above table, some line types are found in all entries, others are optional. Some line typesoccur many times in a single entry. Each entry must begin with an identification line (ID) and end with aterminator line (//).

A detailed description of each line type is given in the next section of this document.

It must be noted that, with the exception of GN, all SWISS-PROT line types exist in the EMBL Database. Adescription of the format differences between the SWISS-PROT and EMBL databases is given in Appendix Cof this document.

The two-character line type code that begins each line is always followed by three blanks, so that the actualinformation begins with the sixth character. Information is not extended beyond character position 75 exceptfor one exception, CC lines that contain the ‘DATABASE’ topic (see section 3.10).

Page 13: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

13

(3). The different line types (3.1). The ID line

The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:

ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH. (3.1.1). Entry Name

The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying asequence. The entry name consists of up to ten uppercase alphanumeric characters.

SWISS-PROT uses a general purpose naming convention that can be symbolized as X_Y, where:

• X is a mnemonic code of at most 4 alphanumeric characters representing the protein name. Examples:B2MG is for Beta-2-microglobulin, HBA is for Hemoglobin alpha chain and INS is for Insulin;

• The ‘_' sign serves as a separator;

• Y is a mnemonic species identification code of at most 5 alphanumeric characters representing thebiological source of the protein. This code is generally made of the first three letters of the genus and thefirst two letters of the species. Examples: PSEPU is for Pseudomonas putida and NAJNI is for Naja nivea.

However, for species most commonly encountered in the database, self-explanatory codes are used. Thereare 16 of those codes. They are: BOVIN for Bovine, CHICK for Chicken, ECOLI for Escherichia coli, HORSEfor Horse, HUMAN for Human, MAIZE for Maize (Zea mays) , MOUSE for Mouse, PEA for Garden pea(Pisum sativum), PIG for Pig, RABIT for Rabbit, RAT for Rat, SHEEP for Sheep, SOYBN for Soybean(Glycine max), TOBAC for Common tobacco (Nicotina tabacum), WHEAT for Wheat (Triticum aestivum),YEAST for Baker's yeast (Saccharomyces cerevisiae).

As it was not possible to apply the above rules to viruses, they were given arbitrary, but generally easy toremember, identification codes. In some cases it was not possible to assign a definitive code to a species. Inthese cases a temporary code was chosen.

Examples of complete protein sequence entry names are: RL1_ECOLI for ribosomal protein L1 fromEscherichia coli, FER_HALHA for ferredoxin from Halobacterium halobium.

The names of all the presently defined species identification codes are listed in the SWISS-PROT documentfile SPECLIST.TXT.

(3.1.2). Data class

The second item on the ID line indicates the data class of the entry (see section 2.2).

Page 14: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

14

(3.1.3). Molecule type

The third item on the ID line is a three-letter code that indicates the type of molecule of the entry: in SWISS-PROT it is ‘PRT‘ (for PRoTein).

(3.1.4). Length of the molecule

The fourth and last item of the ID line is the length of the molecule, which is the total number of amino acidsin the sequence. This number includes the positions reported to be present but which have not beendetermined (coded as ‘X'). The length is followed by the letter code ‘AA’ (Amino Acids).

(3.1.5). Examples of identification lines

Two examples of ID lines are shown below:

ID CYC_BOVIN STANDARD; PRT; 104 AA. ID GIA2_GIALA STANDARD; PRT; 296 AA.

(3.2). The AC line

The AC (ACcession number) line lists the accession number(s) associated with an entry.

The format of the AC line is:

AC AC_number_1;[ AC_number_2;]...[ AC_number_N;] An example of an accession number line is shown below:

AC P00321; P05348; Semicolons separate the accession numbers and a semicolon terminates the list. If necessary, more thanone AC line can be used. Example:

AC Q16653; Q14855; Q13054; Q13055; Q92891; Q92892; Q92893; Q92894; AC Q92895; Q93053; Q99605; O00713; O00714; O00715;

The purpose of accession numbers is to provide a stable way of identifying entries from release to release. Itis sometimes necessary for reasons of consistency to change the names of the entries, for example, toensure that related entries have similar names. However, an accession number is always conserved, andtherefore allows unambiguous citation of SWISS-PROT entries.

Researchers who wish to cite entries in their publications should always cite the first accession number.This is commonly referred to as the ‘primary accession number’.

Entries will have more than one accession number if they have been merged or split. For example, when twoentries are merged into one, the accession numbers from both entries are stored into the AC line(s).

Page 15: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

15

If an existing entry is split into two or more entries (a rare occurrence), the original accession numbers areretained in all the derived entries and a new primary accession number is added to all the entries.

An accession number is dropped only when the data to which it was assigned have been completelyremoved from the database. Deleted accession numbers are listed in the SWISS-PROT document fileDELETEAC.TXT.

Accession numbers consist of 6 alphanumerical characters in the following format:

1 2 3 4 5 6

[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]

Here are some examples of valid accession numbers: P12345, Q1AAA9, O456A1 and P4A123.

(3.3). The DT line

The DT (DaTe) lines show the date of creation and last modification of the database entry.

The format of the DT line is:

DT DD-MMM-YYYY (Rel. XX, Comment)

Where ‘DD' is the day, ‘MMM' the month, ‘YYYY' the year, and ‘XX' the SWISS-PROT release number. Thecomment portion of the line indicates the action taken on that date. There are always three DT lines in eachentry, each of them is associated with a specific comment:

• The first DT line indicates when the entry first appeared in the database. The associated comment is‘Created’;

• The second DT line indicates when the sequence data was last modified. The associated comment is‘Last sequence update’;

• The third DT line indicates when data (see the note below) other than the sequence was last modified.The associated comment is ‘Last annotation update’.

Example of a block of DT lines:

DT 01-AUG-1988 (Rel. 08, Created) DT 01-JAN-1990 (Rel. 13, Last sequence update) DT 15-APR-1999 (Rel. 38, Last annotation update)

Concerning the third DT line, one should note that such a line is not updated when an entry is the target ofwhat we call a ‘global change’. A global change being defined as any operation which involves changes in allor most SWISS-PROT entries. These changes are announced in the release notes and are usually linked to

Page 16: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

16

formatting issues. As such global changes take place at almost each release, we strongly advise users of thedatabase to completely reload SWISS-PROT at each release cycle.

(3.4). The DE line

The DE (DEscription) lines contain general descriptive information about the sequence stored. Thisinformation is generally sufficient to identify the protein precisely.

The format of the DE line is:

DE DESCRIPTION.

The description is given in ordinary English (using US-spelling) and is free-format.

In cases where more than one DE line is required, the text is only divided between words and only the lastDE line is terminated by a period.

The description always starts with the proposed ‘official name’ of the protein. Synonyms are indicatedbetween brackets. Example:

DE ANNEXIN V (LIPOCORTIN V) (ENDONEXIN II) (CALPHOBINDIN I) (CBP-I) DE (PLACENTAL ANTICOAGULANT PROTEIN I) (PAP-I) (PP4) (THROMBOPLASTIN DE INHIBITOR) (VASCULAR ANTICOAGULANT-ALPHA) (VAC-ALPHA) (ANCHORIN CII).

When a protein is known to be cleaved into multiple functional components, the description will start with thename of the precursor protein, followed by a section delimited by ‘[CONTAINS……]’. All the individualcomponents are listed in that section and are separated by semi-colons (‘;’). Synonyms are allowed at thelevel of the precursor and for each individual component. Example:

DE CORTICOTROPIN-LIPOTROPIN PRECURSOR (PRO-OPIOMELANOCORTIN) (POMC) DE [CONTAINS: NPP; MELANOTROPIN GAMMA (GAMMA-MSH); CORTICOTROPIN DE (ADRENOCORTICOTROPIC HORMONE) (ACTH); MELANOTROPIN ALPHA (ALPHA-MSH); DE CORTICOTROPIN-LIKE INTERMEDIARY PEPTIDE (CLIP); LIPOTROPIN BETA (BETA- DE LPH); LIPOTROPIN GAMMA (GAMMA-LPH); MELANOTROPIN BETA (BETA-MSH); DE BETA-ENDORPHIN; MET-ENKEPHALIN].

When a protein is known to include multiple domains each of which are described by a different name, thedescription will start with the name of the overall protein, followed by a section delimited by ‘[INCLUDES……]’.All the domains are listed in that section and are separated by semi-colons (‘;’). Synonyms are allowed at thelevel of the protein and for each individual domain. Example:

DE CAD PROTEIN (RUDIMENTARY PROTEIN) [INCLUDES: GLUTAMINE-DEPENDENT DE CARBAMOYL-PHOSPHATE SYNTHASE (EC 6.3.5.5); ASPARTATE DE CARBAMOYLTRANSFERASE (EC 2.1.3.2); DIHYDROOROTASE (EC 3.5.2.3)].

Page 17: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

17

When the complete sequence was not determined, the last information given on the DE lines will be‘(FRAGMENT)‘ or ‘(FRAGMENTS)’. Example:

DE LYSOPINE DEHYDROGENASE (EC 1.5.1.16) (OCTOPINE SYNTHASE) DE (LYSOPINE SYNTHASE) (FRAGMENT).

(3.5). The GN line

The GN (Gene Name) line contains the name(s) of the gene(s) that code for the stored protein sequence.

The format of the GN line is:

GN NAME1[ AND|OR NAME2...].

Examples:

GN ALB. GN REX-1. It often occurs that more than one gene name has been assigned to an individual locus. In that case all thesynonyms will be listed. The word ‘OR' separates the different designations. The first name in the list isassumed to be the most correct (or most current) designation. Example:

GN HNS OR DRDX OR OSMZ OR BGLY. In a few cases, multiple genes code for an identical protein sequence. In that case all the different genenames will be listed. The word ‘AND' separates the designations. Example:

GN CECA1 AND CECA2. In very rare cases ‘AND' and ‘OR' can both be present. In that case parentheses are used as shown in thefollowing example:

GN GVPA AND (GVPB OR GVPA2).

(3.6). The OS line

The OS (Organism Species) line specifies the organism(s) which was the source of the stored sequence. Inthe rare case where all the species information will not fit on a single line more than one OS line is used. Thelast OS line is terminated by a period.

The species designation consists, in most cases, of the Latin genus and species designation followed by theEnglish name (in parentheses). For viruses, only the common English name is given. In cases where aprotein sequence is identical in more then one species, the OS line(s) will list the names of all those species.

Page 18: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

18

Examples of OS lines are shown here:

OS Escherichia coli. OS Homo sapiens (Human). OS Acer spicatum (Moose maple) (Mountain maple). OS Rous sarcoma virus (strain Schmidt-Ruppin). If a SWISS-PROT entry reports the sequence of a protein identical in a number of species, the name of thesespecies will all be listed in the OS lines of that entry. The species names are separated by commas, the lastspecies name being preceded by the word ‘and’. Species names are never cut across two lines. Here are examples of the OS lines for entries representing multiple species: OS Oncorhynchus nerka (Sockeye salmon), and OS Oncorhynchus masou (Cherry salmon) (Masu salmon). OS Mus musculus (Mouse), Rattus norvegicus (Rat), and OS Bos taurus (Bovine).

(3.7). The OG line

The OG (OrGanelle) line indicates if the gene coding for a protein originates from the mitochondria, thechloroplast, a cyanelle, or a plasmid.

The format of the OG line is:

OG Chloroplast. OG Cyanelle. OG Mitochondrion. OG Plasmid name.

Where 'name' is the name of the plasmid.

If a SWISS-PROT entry reports the sequence of a protein identical in a number of plasmids, the name ofthese plasmids will all be listed in the OG lines of that entry. The plasmid names are separated by commas,the last plasmid name being preceded by the word ‘and’. Plasmid names are never cut across two lines.Examples: OG Plasmid IncFIV R124, and Plasmid IncFI ColV3-K30. OG Plasmid R6-5, Plasmid IncFII NR1, and OG Plasmid IncFII R1-19 (R1 drd-19).

Page 19: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

19

(3.8). The OC line

The OC (Organism Classification) lines contain the taxonomic classification of the source organism. Thetaxonomic classification used in SWISS-PROT is that maintained at the NCBI (seehttp://www.ncbi.nlm.nih.gov/Taxonomy/) and used by the nucleotide sequence databases(EMBL/GenBank/DDBJ). The NCBI’s taxonomy reflects current phylogenetic knowledge. It is a sequence-based taxonomy as much as possible and based on published authorities wherever possible. Because of theinherent ambiguity of evolutionary classification and the specific needs of database users (e.g., trying to trackdown the phylogenetic history of a group of organisms or to elucidate the evolution of a molecule), thistaxonomy strives to accurately reflect current phylogenetic knowledge. The NCBI’s taxonomy is intended tobe informative and helpful; no claim is made that it is necessarily the best or most exact.

The classification is listed top-down as nodes in a taxonomic tree in which the most general grouping is givenfirst. The classification may be distributed over several OC lines, but nodes are not split or hyphenatedbetween lines. Semicolons separate the individual items and the list is terminated by a period.

The format of the OC line is:

OC Node[; Node...]. For example the classification lines for a human sequence would be:

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

If a protein is identical in more than one species, all the species names will be listed (see 3.6) but the OClines will only contain the classification for the first species listed.

(3.9). The reference (RN, RP, RC, RX, RA, RT, RL) lines

These lines comprise the literature citations within SWISS-PROT. The citations indicate the sources fromwhich the data has been abstracted. The reference lines for a given citation occur in a block, and are alwaysin the order RN, RP, RC, RX, RA, RT and RL. Within each such reference block the RN and RP lines occuronce, the RC, RX and RT lines occur zero or more times, and the RA and RL lines each occur one or moretimes. If several references are given, there will be a reference block for each.

An example of a complete reference is:

RN [1] RP SEQUENCE FROM N.A., AND SEQUENCE OF 1-15. RC STRAIN=Sprague-Dawley; TISSUE=Liver; RX MEDLINE; 91002678. RA Chan Y.-L., Paz V., Olvera J., Wool I.G.;

Page 20: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

20

RT "The primary structure of rat ribosomal proteins: the amino acid RT sequences of L27a and L28 and corrections in the sequences of S4 and RT S12."; RL Biochim. Biophys. Acta 1050:69-73(1990).

The formats of the individual lines are explained below.

(3.9.1). The RN line

The RN (Reference Number) line gives a sequential number to each reference citation in an entry. Thisnumber is used to indicate the reference in comments and feature table notes. The format of the RN line is:

RN [N]

where ‘N’ denotes the nth reference for this entry. The reference number is always enclosed in squarebrackets.

(3.9.2). The RP line

The RP (Reference Position) line describes the extent of the work carried out by the authors of the referencecited. The format of the RP line is:

RP COMMENT.

Typical examples of RP lines are shown below:

RP SEQUENCE FROM N.A. RP SEQUENCE FROM N.A., AND SEQUENCE OF 12-35. RP SEQUENCE OF 34-56; 67-73 AND 123-345, AND DISULFIDE BONDS. RP REVISIONS TO 67-89. RP STRUCTURE BY NMR. RP X-RAY CRYSTALLOGRAPHY (1.8 ANGSTROMS). RP CHARACTERIZATION. RP MUTAGENESIS OF TYR-56. RP REVIEW. RP VARIANT ALA-58. RP VARIANTS XLI LEU-341; ARG-372 AND TYR-446.

(3.9.3). The RC line

The RC (Reference Comment) lines are optional lines which are used to store comments relevant to thereference cited. The format of the RC line is:

RC TOKEN1=Text; TOKEN2=Text; ..... Where the currently defined tokens are:

Page 21: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

21

PLASMID SPECIES STRAIN TISSUE TRANSPOSON

Examples of RC lines:

RC STRAIN=Sprague-Dawley; TISSUE=Liver; RC STRAIN=Holstein; TISSUE=Mammary gland, and Lymph node; RC SPECIES=Rat; STRAIN=Wistar; RC SPECIES=A.thaliana; STRAIN=cv. Columbia; RC PLASMID=IncFII R100;

The ‘SPECIES' token is only used when an entry describes a sequence that is identical in more than onespecies; similarly the ‘PLASMID' is only used if an entry describes a sequence identical in more than oneplasmid.

The SWISS-PROT document TISSLIST.TXT lists all the tissues that are used in the database in the contextof the ‘TISSUE’ token.

(3.9.4). The RX line

The RX (Reference cross-reference) line is an optional line which is used to indicate the identifier assigned toa specific reference in a bibliographic database. The format of the RX line is:

RX BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER.

Where the valid bibliographic database names and their associated identifier are:

Name: MEDLINE Database: Medline from the National Library of Medicine (NLM) Identifier: Eight-digit Medline Unique Identifier (UID) Example of RX line:

RX MEDLINE; 91002678.

(3.9.5). The RA line

The RA (Reference Author) lines list the authors of the paper (or other work) cited. All of the authors areincluded, and are listed in the order given in the paper. The names are listed surname first followed by ablank, followed by initial(s) with periods. The authors' names are separated by commas and terminated by asemicolon. Author names are not split between lines. An example of the use of RA lines is shown below:

Page 22: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

22

RA Coffman B.L., Tephly T.R., Irshaid Y.M., Green M.D., Smith C., RA Jackson M.R., Wooster R., Burchell B.;

As many RA lines as necessary are included for each reference.

An author’s initials can be followed by an abbreviation such as ‘Jr’ (for Junior), ‘Sr’ (Senior), ‘II’, ‘III’ or‘IV‘ (2nd, 3rd and 4th). Example:

RA Smith H. Jr., von Braun M.T. III;

(3.9.6). The RT line

The RT (Reference Title) lines give the title of the paper (or other work) cited as exactly as possible given thelimitations of the computer character set. The format of the RT line is: RT "Title";

Example of a set of RT lines:

RT "New insulin-like proteins with atypical disulfide bond pattern RT characterized in Caenorhabditis elegans by comparative sequence RT analysis and homology modeling.";

It should be noted that: the format of the title is not always identical to that displayed at the top of thepublished work:

• Major title words are not capitalized;• The text of a title ends with either a period '.', a question mark ‘?‘ or an exclamation mark ‘!‘;• Double quotation marks ‘"‘ in the text of the title are replaced by single quotation marks;• Titles of articles published in a language other than English have been translated into English;• Greek letters are spelled out (alpha, beta, etc.).

(3.9.7). The RL line

The RL (Reference Location) lines contain the conventional citation information for the reference. In general,the RL lines alone are sufficient to find the paper in question.

a) Journal citations

The RL line for a journal citation includes the journal abbreviation, the volume number, the page range, andthe year. The format for such a RL line is:

RL Journal_abbrev Volume:First_page-Last_page(YYYY).

Page 23: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

23

Journal names are abbreviated according to the conventions used by the National Library of Medicine (NLM)and are based on the existing ISO and ANSI standards. A list of the abbreviations currently in use is given inthe SWISS-PROT document file JOURLIST.TXT.

An example of an RL line is:

RL J. Mol. Biol. 168:321-331(1983).

When a reference is made to a paper which is ‘in press' at the time when the database is released, the pagerange, and eventually the volume number are indicated as '0' (zero). An example of a RL line of such type isshown here:

RL Nucleic Acids Res. 27:0-0(1999).

b) Book citations

A variation of the RL line format is used for papers found in books or other similar publications, which arecited using the following format:

RL (In) Editor_1 I.[, Editor_2 I., Editor_X I.] (eds.); RL Book_name, pp.[Volume:]First_page-Last_page, Publisher, City (YYYY). Examples: RL (In) Boyer P.D. (eds.); RL The enzymes (3rd ed.), pp.11:397-547, Academic Press, New York (1975). RL (In) Rich D.H., Gross E. (eds.); RL Proceedings of the 7th american peptide symposium, pp.69-72, RL Pierce Chemical Co., Rockford Il. (1981). RL (In) Magnusson S., Ottesen M., Foltmann B., Dano K., RL Neurath H. (eds.); RL Regulatory proteolytic enzymes and their inhibitors, pp.163-172, RL Pergamon Press, New York (1978).

c) Plant Gene Register and Worm Breeder’s Gazette citations

The ‘(In)’ prefix used for books (see above) is also used for references to the electronic Plant GeneRegister (http://www.tarweed.com/pgr/) as well as to the Worm Breeder's Gazette(http://elegans.swmed.edu/wli/). Examples:

RL (In) Plant Gene Register PGR98-023. RL (In) Worm Breeder's Gazette 15(3):34(1998).

d) Unpublished results

RL lines for unpublished results follow the format shown in the next example:

Page 24: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

24

RL Unpublished results, cited by: RL Shelnutt J.A., Rousseau D.L., Dethmers J.K., Margoliash E.; RL Biochemistry 20:6485-6497(1981).

e) Unpublished observations

For unpublished observations the format of the RL line is:

RL Unpublished observations (MMM-YYYY).

Where ‘MMM' is the month and ‘YYYY' is the year.

We use the ‘unpublished observations' RL line to cite communications by scientists to SWISS-PROT ofunpublished information concerning various aspects of a sequence entry.

f) Thesis

For Ph.D. theses the format of the RL line is:

RL Thesis (Year), Institution_name, Country.

An example of such a line is given here:

RL Thesis (1972), University of Geneva, Switzerland.

g) Patent applications

For patent applications the format of the RL line is:

RL Patent number Pat_num, DD-MMM-YYYY.

Where ‘Pat_num‘ is the international publication number of the patent, ‘DD‘ is the day, ‘MMM‘ is the month and‘YYYY‘ is the year. Example:

RL Patent number WO9010703, 20-SEP-1990.

h) Submissions

The final form that an RL line can take is that used for submissions. The format of such a RL line is:

RL Submitted (MMM-YYYY) to the Database_name.

Where ‘MMM' is the month, ‘YYYY' is the year and ‘Database_name‘ is one of the following:

EMBL/GenBank/DDBJ databases SWISS-PROT data bank HIV data bank PDB data bank

Page 25: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

25

PIR data bank Two examples of submission RL lines are given here: RL Submitted (APR-1994) to the EMBL/GenBank/DDBJ databases. RL Submitted (FEB-1999) to the SWISS-PROT data bank. (3.10). The CC line

The CC lines are free text comments on the entry, and may be used to convey any useful information. Thecomments always appear below the last reference line and are grouped together in comment blocks, a blockbeing made of 1 or more comment lines. The first line of a block start is marked with the characters ‘-!-'.

The format of a comment block is:

CC -!- TOPIC: FIRST LINE OF A COMMENT BLOCK; CC SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK. The comment blocks are arranged according to what we designate as 'topics’. The current topics and theirdefinitions are listed in the table below.

Topic Description ALTERNATIVE PRODUCTS Description of the existence of related protein sequence(s) produced by

alternative splicing of the same gene or by the use of alternative initiationcodons

CATALYTIC ACTIVITY Description of the reaction(s) catalyzed by an enzyme [1] CAUTION This topic warns you about possible errors and/or grounds for confusion COFACTOR Description of an enzyme cofactor DATABASE Description of a cross-reference to a network database/resource for a

specific protein [2] DEVELOPMENTAL STAGE Description of the developmental specific expression of a protein DISEASE Description of the disease(s) associated with a deficiency of a protein DOMAIN Description of the domain structure of a protein ENZYME REGULATION Description of an enzyme regulatory mechanism FUNCTION General description of the function(s) of a protein INDUCTION Description of the compound(s) which stimulate the synthesis of a protein MASS SPECTROMETRY Reports the exact molecular weight of a protein or part of a protein as

determined by mass spectrometric methods [3] MISCELLANEOUS Any comment which does not belong to any of the other defined topics PATHWAY Description of the metabolic pathway(s) to which a protein is associated PHARMACEUTICAL Description of the use of a specific protein as a pharmaceutical drug POLYMORPHISM Description of polymorphism(s) PTM Description of a post-translational modification SIMILARITY Description of the similaritie(s) (sequence or structural) of a protein with

other proteins SUBCELLULAR LOCATION Description of the subcellular location of the mature protein SUBUNIT Description of the quaternary structure of a protein TISSUE SPECIFICITY Description of the tissue specificity of a protein

Page 26: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

26

Notes: [1] For the ‘CATALYTIC ACTIVITY‘ topic: Whenever it was possible we have used, to describe the catalytic activity of an enzyme, therecommendations of the Nomenclature Committee of the International Union of Biochemistry and MolecularBiology (IUBMB) as published in Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York, (1992).

[2] The syntax of the 'DATABASE' topic is: CC -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"].

Where:

• ‘NAME’ is the name of the database;• ‘NOTE’ (optional) is a free text note;• ‘WWW’ (optional) is the WWW address (URL) of the database;• ‘FTP’ (optional) is the anonymous FTP address (including the directory name) where the database file(s)

are stored. Note: this is currently the only part of the database where lines longer than 75 characters can be found aslong URL or FTP addresses are not reformatted into multiple lines.

[3] The syntax of the 'MASS SPECTROMETRY' topic is:

CC -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX][; METHOD=XX][;RANGE=XX-XX].

Where:

• ‘MW=XXX’ is the determined molecular weight (MW);• ‘MW_ERR=XX’ (optional) is the accuracy or error range of the MW measurement;• ‘METHOD=XX’ (optional) is the mass spectrometric method;• ‘RANGE=XX-XX’ (optional) is used to indicate what part of the protein sequence entry corresponds to the

molecular weight. If this qualifier is not present, the MW value corresponds to the full length of the proteinsequence.

Each SWISS-PROT entry will contain a variable number of CC line topics. Most topics can be present morethan once in a given entry. The only topics that can only occur only once in an entry are: ALTERNATIVEPRODUCTS, COFACTOR, DEVELOPMENTAL STAGE, ENZYME REGULATION, INDUCTION, SUBCELLULARLOCATION, SUBUNIT and TISSUE SPECIFICITY.

We show here, for each of the defined topics, two examples of their usage: CC -!- ALTERNATIVE PRODUCTS: AT LEAST THREE ISOFORMS; AIRE-1 (SHOWN

Page 27: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

27

CC HERE), AIRE-2 AND AIRE-3; SEEM TO BE PRODUCED BY ALTERNATIVE CC SPLICING. AIRE-2 AND AIRE-3 SEEMS TO BE LESS FREQUENTLY CC EXPRESSED THAN AIRE-1, IF AT ALL. CC -!- ALTERNATIVE PRODUCTS: USING ALTERNATIVE INITIATION CODONS IN CC THE SAME READING FRAME, THE GENE TRANSLATES INTO THREE CC ISOZYMES: ALPHA, BETA AND BETA'. CC -!- CATALYTIC ACTIVITY: ATP + L-GLUTAMATE + NH(3) = ADP + CC GLUTAMINE + PHOSPHATE. CC -!- CATALYTIC ACTIVITY: (R)-2,3-DIHYDROXY-3-METHYLBUTANOATE + CC NADP(+) = (S)-2-HYDROXY-2-METHYL-3-OXOBUTANOATE + NADPH. CC -!- CAUTION: REF.2 SEQUENCE DIFFERS FROM THAT SHOWN IN POSITIONS CC 92 TO 165 DUE TO A FRAMESHIFT. CC -!- CAUTION: IT IS UNCERTAIN WHETHER MET-1 OR MET-3 IS THE CC INITIATOR. CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. CC -!- COFACTOR: FAD AND NONHEME IRON. CC -!- DATABASE: NAME=CD40Lbase; CC NOTE=European CD40L defect database (mutation db); CC WWW="http://www.expasy.ch/cd40lbase/"; CC FTP="ftp://www.expasy.ch/databases/cd40lbase". CC -!- DATABASE: NAME=PROW; NOTE=CD guide CD80 entry; CC WWW="http://www.ncbi.nlm.nih.gov/prow/cd/cd80.htm". CC -!- DEVELOPMENTAL STAGE: EXPRESSED EARLY DURING CONIDIAL (DORMANT CC SPORES) DIFFERENTIATION. CC -!- DEVELOPMENTAL STAGE: EXPRESSED IN EMBRYONIC AND EARLY LARVAL CC STAGES. CC -!- DISEASE: DEFECTS IN PHKA1 ARE LINKED TO X-LINKED MUSCLE CC GLYCOGENOSIS, A DISEASE CHARACTERIZED BY SLOWLY PROGRESSIVE, CC PREDOMINANTLY DISTAL MUSCLE WEAKNESS AND ATROPHY. CC -!- DISEASE: DEFECTS IN ALD ARE THE CAUSE OF X-LINKED CC ADRENOLEUKODYSTROPHY, A PEROXISOMAL DISORDER CHARACTERIZED BY CC PROGRESSIVE DEMYLEINATION OF THE CNS AND ADRENAL CC INSUFFICIENCY. CC -!- DOMAIN: CONTAINS A COILED-COIL DOMAIN ESSENTIAL FOR VESICULAR CC TRANSPORT AND A DISPENSABLE C-TERMINAL REGION. CC -!- DOMAIN: THE B CHAIN IS COMPOSED OF TWO DOMAINS, EACH DOMAIN CC CONSISTS OF 3 HOMOLOGOUS SUBDOMAINS (ALPHA, BETA, GAMMA). CC -!- ENZYME REGULATION: THE ACTIVITY OF THIS ENZYME IS CONTROLLED CC BY ADENYLATION. THE FULLY ADENYLATED ENZYME IS INACTIVE. CC -!- ENZYME REGULATION: ACTIVATED BY GRAM-NEGATIVE BACTERIAL CC LIPOPOLYSACCHARIDES AND CHYMOTRYPSIN.

Page 28: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

28

CC -!- FUNCTION: PROFILIN PREVENTS THE POLYMERIZATION OF ACTIN. CC -!- FUNCTION: INHIBITOR OF FUNGAL POLYGALACTURONASE. IT IS AN CC IMPORTANT FACTOR FOR PLANT RESISTANCE TO PHYTOPATHOGENIC CC FUNGI. CC -!- INDUCTION: BY SALT STRESS AND BY ABSCISIC ACID (ABA). CC -!- INDUCTION: BY INFECTION, PLANT WOUNDING, OR ELICITOR CC TREATMENT OF CELL CULTURES. CC -!- MASS SPECTROMETRY: MW=71890; MW_ERR=7; METHOD=MALDI. CC -!- MASS SPECTROMETRY: MW=8597.5; METHOD=ELECTROSPRAY; CC RANGE=40-119. CC -!- MISCELLANEOUS: BINDS TO BACITRACIN. CC -!- MISCELLANEOUS: JUVENILE HORMONE SUPPRESSES TRANSFERRIN LEVELS CC DRASTICALLY IN THE ADULT FEMALE COCKROACH. CC -!- PATHWAY: FIRST STEP IN PROLINE BIOSYNTHESIS PATHWAY. CC -!- PATHWAY: LAST STEP IN PROTOHEME BIOSYNTHESIS. IN ERYTHROID CC CELLS, FERROCHELATASE APPEARS TO BE THE RATE-LIMITING ENZYME. CC -!- PHARMACEUTICAL: Available under the names Avonex (Biogen), CC Betaseron (Berlex) and Rebif (Serono). Used in the treatment of CC multiple sclerosis (MS). Betaseron is a slightly modified form CC of IFNB1 with two residue substitutions. CC -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron). Used CC in patients with renal cell carcinoma or metastatic melanoma. CC -!- POLYMORPHISM: THE ALLELIC FORM OF THE ENZYME WITH GLN-191 CC HYDROLYZES PARAOXON WITH A LOW TURNOVER NUMBER AND THE ONE CC WITH ARG-191 WITH A HIGH TURNOVER NUMBER. CC -!- POLYMORPHISM: THE TWO MAIN ALLELES OF HP ARE CALLED HP1F CC (FAST) AND HP1S (SLOW). THE SEQUENCE SHOWN HERE IS THAT OF THE CC HP1S FORM. CC -!- PTM: O-GLYCOSYLATED; AN UNUSUAL FEATURE AMONG VIRAL CC GLYCOPROTEINS. CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY CC PROTEOLYTIC PROCESSING. CC -!- SIMILARITY: BELONGS TO THE ANNEXIN FAMILY. CC -!- SIMILARITY: CONTAINS 12 EGF-LIKE DOMAINS. CC -!- SUBCELLULAR LOCATION: MITOCHONDRIAL MATRIX. CC -!- SUBCELLULAR LOCATION: INTEGRAL MEMBRANE PROTEIN. INNER CC MEMBRANE. CC -!- SUBUNIT: HOMOTETRAMER. CC -!- SUBUNIT: HETERODIMER OF A LIGHT CHAIN AND A HEAVY CHAIN LINKED CC BY A DISULFIDE BOND.

Page 29: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

29

CC -!- TISSUE SPECIFICITY: KIDNEY, SUBMAXILLARY GLAND, AND URINE. CC -!- TISSUE SPECIFICITY: SHOOTS, ROOTS, AND COTYLEDON FROM CC DEHYDRATING SEEDLINGS. (3.11). The DR line

(3.11.1). Definition

The DR (Database cross-Reference) lines are used as pointers to information related to SWISS-PROTentries and found in data collections other than SWISS-PROT. The full list of all databases to which SWISS-PROT is cross-referenced can be found in the document file DBXREF.TXT.

For example, if the X-ray crystallographic atomic coordinates of a sequence are stored in the Protein DataBank (PDB) there will be DR line(s) pointing to the corresponding entri(es) in that database. For a sequencetranslated from a nucleotide sequence there will be DR line(s) pointing to the relevant entri(es) in theEMBL/GenBank/DDBJ database which correspond to the DNA or RNA sequence(s) from which it wastranslated.

The format of the DR line is:

DR DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER. Exceptions are cross-references to the EMBL/GenBank/DDBJ nucleotide sequence database and for thePROSITE and Pfam databases. The specific formats for these cross-references are described in sections3.11.5 and 3.11.6.

Page 30: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

30

(3.11.2). Database identifier

The first item on the DR line, the ‘DATABASE_IDENTIFIER’, is the abbreviated name of the data collection towhich reference is made. The currently defined database identifiers are listed below.

Identifier Database description EMBL Nucleotide sequence database of EMBL (EBI) (see 3.11.5) CARBBANK Complex carbohydrate structure database (CCSD) from CarbBank DICTYDB Dictyostelium discoideum genome database ECO2DBASE Escherichia coli gene-protein database (2D gel spots) (ECO2DBASE) ECOGENE Escherichia coli K12 genome database (EcoGene) FLYBASE Drosophila genome database (FlyBase) GCRDB G protein-coupled receptor database (GCRDb) HIV HIV sequence database HSC-2DPAGE Harefield hospital 2D gel protein databases (HSC-2DPAGE) HSSP Homology-derived secondary structure of proteins database (HSSP). INTERPRO Integrated resource of protein families, domains and functional sites (InterPro) MAIZEDB Maize genome database (MaizeDB) MAIZE-2DPAGE Maize genome 2D Electrophoresis database (Maize-2DPAGE) MENDEL Plant gene nomenclature database (Mendel) MGD Mouse genome database (MGD) MIM Mendelian Inheritance in Man Database (MIM) PDB 3D-macromolecular structure Protein Data Bank (PDB) PFAM Pfam protein domain database (see 3.11.6) PIR Protein sequence database of the Protein Information Resource (PIR) PRINTS Protein Fingerprint database (PRINTS) PROSITE PROSITE protein domains and families database (see 3.11.6) REBASE Restriction enzymes and methylases database (REBASE) AARHUS/GHENT-2DPAGE

Human keratinocyte 2D gel protein database from Aarhus and Ghent universities

SGD Saccharomyces Genome Database (SGD) STYGENE Salmonella typhimurium LT2 genome database (StyGene) SUBTILIST Bacillus subtilis 168 genome database (SubtiList) SWISS-2DPAGE 2D-PAGE database from the Geneva University Hospital (SWISS-2DPAGE) TIGR The bacterial database(s) of 'The Institute of Genome Research' (TIGR) TRANSFAC Transcription factor database (TRANSFAC) TUBERCULIST Mycobacterium tuberculosis H37Rv genome database (TubercuList) WORMPEP Caenorhabditis elegans genome sequencing project protein database (WormPep) YEPD Yeast electrophoresis protein database (YEPD) ZFIN Zebrafish Information Network genome database (ZFIN)

(3.11.3). The primary identifier

The second item on the DR line, the ‘PRIMARY_IDENTIFER’, is an unambiguous pointer to the informationentry in the database to which reference is being made.

Page 31: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

31

• For a CarbBank, DictyDb, EcoGene, FlyBase, GCRDb, HIV, HSC-2DPAGE, InterPro, MAIZE-2DPAGE,Mendel, MGD, MIM, PIR, PRINTS, REBASE, SGD, StyGene, SubtiList, SWISS-2DPAGE, TRANSFAC,TubercuList or ZFIN reference the primary identifier is the first accession number (also called the UniqueIdentifier in some databases) of the entry to which reference is being made.

• For a PDB reference the primary identifier is the entry name.• For an AARHUS/GHENT-2DPAGE, ECO2DBASE or YEPD reference the primary identifier is the protein

spot alphanumeric designation.• For a WormPep reference the primary identifier is the cosmid-derived name given to that protein by the

C.elegans genome-sequencing project.• For a MaizeDB reference the primary identifier is the ‘Gene-product’ accession ID.• For a TIGR reference the primary identifier is the genome Open Reading Frame (ORF) code.• For a HSSP reference the primary identifier is the accession number of a SWISS-PROT entry cross-

referenced to a PDB entry whose structure is expected to be similar to that of the entry in which the HSSPcross-reference is present.

(3.11.4). The secondary identifier

The third item on the DR line, the ‘SECONDARY_IDENTIFIER‘, is generally used to complement theinformation given by the first identifier.

• For an HIV, PIR, PRINTS or REBASE reference the secondary identifier is the entry's name.• For a PDB reference the secondary identifier is the most recent date on which PDB revised the entry (last

‘REVDAT' record).• For a DictyDb, EcoGene, FlyBase, Mendel, MGD, SGD, StyGene, SubtiList or ZFIN reference the

secondary identifier is the gene designation. If the gene designation is not available, a dash (‘-‘) is used.• For an ECO2DBASE reference the secondary identifier is the latest release number or edition of the

database that has been used to derive the cross-reference.• For a SWISS-2DPAGE, HSC-2DPAGE or MAIZE-2DPAGE reference the secondary identifier is the

species or tissue of origin.• For an AARHUS/GHENT-2DPAGE reference the secondary identifier is either ‘IEF' (for isoelectric

focusing) or ‘NEPHGE' (for non-equilibrium pH gradient electrophoresis).• For a WormPep reference the secondary identifier is a number attributed by the C.elegans genome-

sequencing project to that protein.• For a CarbBank GCRDb, InterPro, MaizeDB, MIM, TIGR, TRANSFAC, TubercuList or YEPD reference

the secondary identifier is not defined and a dash (‘-‘) is stored in that field.• For an HSSP reference the secondary identifier is the entry name of the PDB structure related to that of

the entry in which the HSSP cross-reference is present.

Page 32: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

32

Examples of complete DR lines are shown here:

DR AARHUS/GHENT-2DPAGE; 8006; IEF.DR CARBBANK; CCSD:27494; -.DR DICTYDB; DD01047; myoA.DR ECO2DBASE; G052.0; 6TH EDITION.DR ECOGENE; EG10054; araC.DR FLYBASE; FBgn0000055; Adh.DR GCRDB; GCR_0087; -.DR HIV; K02013; NEF$BRU.DR HSC-2DPAGE; P47985; HUMAN.DR HSSP; P00438; 1DOB.DR INTERPRO; IPR001254; -.DR MAIZEDB; 25342; -.DR MAIZE-2DPAGE; P80607; COLEOPTILE.DR MENDEL; 2596; AMAhy;psbA;1.DR MGD; MGI:87920; Adfp.DR MIM; 249900; -.DR PDB; 3ADK; 16-APR-88.DR PIR; A02768; R5EC7.DR PRINTS; PR00237; GPCRRHODOPSN.DR REBASE; RB00993; EcoRI.DR SGD; L0000008; AAR2.DR STYGENE; SG10312; proV.DR SUBTILIST; BG10774; oppD.DR SWISS-2DPAGE; P10599; HUMAN.DR TIGR; MJ0125; -.DR TRANSFAC; T00141; -.DR TUBERCULIST; Rv0001; -.DR WORMPEP; ZK637.7; CE00437.DR YEPD; 4270; -.DR ZFIN; ZDB-GENE-980526-290; hoxa1.

(3.11.5). Cross-references to the nucleotide sequence database

The specific format for cross-references to the EMBL/GenBank/DDBJ nucleotide sequence database is:

DR EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER.

Where ‘PROTEIN_ID' stands for the ‘Protein Sequence Identifier’. It is a string which is stored, in nucleotidesequence entries, in a qualifier called ‘/protein_id’ which is tagged to every CDS in the nucleotidedatabase. Example:

FT CDS 302..2674FT /protein_id="CAA03857.1"FT /db_xref="SWISS-PROT:P26345"FT /gene="recA"FT /product="RecA protein"

Page 33: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

33

The Protein Sequence Identifier (Protein_ID) consists of a stable ID portion (8 characters: 3 letters followedby 5 numbers) plus a version number after a decimal point. The version number only changes when theprotein sequence coded by the CDS changes, while the stable part remains unchanged. The Protein_IDeffectively replaces what was previously known as the ‘PID’.

The 'STATUS_IDENTIFIER' provides information about the relationship between the sequence in theSWISS-PROT entry and the CDS in the corresponding EMBL entry.

a) In most cases the translation of the EMBL nucleotide sequence CDS results in the same sequence asshown in the corresponding SWISS-PROT entry or the differences are mentioned in the SWISS-PROTfeature (FT) lines as CONFLICT, VARIANT or VARSPLIC and in the RP lines. In these cases the statusidentifier shows a dash (‘-‘).

Example:

DR EMBL; Y00312; CAA68412.1; -.

b) In some cases the translation of the EMBL nucleotide sequence CDS results in a sequence different fromthe sequence shown in the corresponding SWISS-PROT entry. When the differences are either notmentioned in the SWISS-PROT feature (FT) lines as CONFLICT, VARIANT or VARSPLIC (see Appendix A)and in the RP lines, or do simply not meet the criteria for such situations, the differences are indicated asfollows:

1� If the difference is due to a different start of the sequence (e.g. SWISS-PROT believesthat the start of the sequence is upstream or downstream of the site annotated as the start ofthe sequence in the EMBL database), the status identifier shows the comment ‘ALT_INIT’.Example:

DR EMBL; L29151; AAA99430.1; ALT_INIT. 2� If the difference is due to a different termination of the sequence (e.g. SWISS-PROT

believes that the termination of the sequence is upstream or downstream of the site annotatedas the end of the sequence in the EMBL database), the status identifier shows the comment‘ALT_TERM’. Example:

DR EMBL; L20562; AAA26884.1; ALT_TERM.

3� If the difference is due to frameshifts in the EMBL sequence, the status identifier showsthe comment ‘ALT_FRAME’. Example:

DR EMBL; X56420; CAA39814.1; ALT_FRAME.

4� If the difference is not due to any of the cases mentioned above (e.g. wrong intron-exon boundaries given in the EMBL entry) or to a mixture of the cases mentioned above, thestatus identifier shows the comment ‘ALT_SEQ’. Example:

Page 34: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

34

DR EMBL; M28482; AAA26378.1; ALT_SEQ.

c) In some cases the nucleotide sequence of a complete CDS is divided into exons present in different EMBLentries. We point to the exon containing EMBL entries by citing the Protein_ID as secondary identifier andadding the comment ‘JOINED’ into the status identifier. These EMBL entries are not containing a CDSfeature, they contain exons joined to a CDS feature which is labeled with the given Protein_ID.

Example:

DR EMBL; M63397; AAA51662.1; -.DR EMBL; M63395; AAA51662.1; JOINED.DR EMBL; M63396; AAA51662.1; JOINED.

In the above example the SWISS-PROT sequence is derived from the CDS labeled with the Protein_IDAAA51662. This CDS feature can be found in the EMBL entry M63397. Exons belonging to this CDS are notonly found in EMBL entry M63397, but also in the EMBL entries M63395 and M63396.

d) In some cases there is no CDS feature key annotating a protein translation in an EMBL entry and thus noProtein_ID for that CDS. Therefore it is not possible for us to point to a Protein_ID as a secondary identifier.In these cases we point to the relevant EMBL entries by including a dash (‘-‘) in the position of the missingProtein_ID and ‘NOT_ANNOTATED_CDS’ into the status identifier.

Example:

DR EMBL; J04126; -; NOT_ANNOTATED_CDS.

(3.11.6). Cross-references to the PROSITE and Pfam databases

The specific format for cross-references to the PROSITE and Pfam protein domain and family databases is:

DR PROSITE ¦ PFAM; ACCESSION_NUMBER; ENTRY_NAME; STATUS.

Where ‘ACCESSION_NUMBER' stands for the accession number of the PROSITE or Pfam pattern, profile orHMM-profile entry; ‘ENTRY_NAME’ is the name of the entry and 'STATUS' is one of the following:

nFALSE_NEGPARTIALUNKNOWN_n

Where ‘n’ is the number of hits of the pattern or profile in that particular protein sequence. The ‘FALSE_NEG’status indicates that while the pattern or profile did not detect the protein sequence, it is a member of thatparticular family or domain. The ‘PARTIAL’ status indicates that the pattern or profile did not detect thesequence because that sequence is not complete and lacks the region on which is the pattern/profile isbased. Finally the ‘UNKNOWN’ status indicates uncertainties as to the fact that the sequence is a member of

Page 35: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

35

the family or domain described by the pattern/profile. Pfam cross-references do not make use of the‘FALSE_NEG’ and ‘UNKNOWN’ status.

Examples of PROSITE and Pfam cross-references:

DR PROSITE; PS00107; PROTEIN_KINASE_ATP; 1.DR PROSITE; PS00028; ZINC_FINGER_C2H2; 6.DR PROSITE; PS00237; G_PROTEIN_RECEPTOR; FALSE_NEG.DR PROSITE; PS01128; SHIKIMATE_KINASE; PARTIAL.DR PROSITE; PS00383; TYR_PHOSPHATASE_1; UNKNOWN_1.

DR PFAM; PF00017; SH2; 1.DR PFAM; PF00008; EGF; 8.DR PFAM; PF00595; PDZ; PARTIAL.

(3.12). The KW line

The KW (KeyWord) lines provide information that can be used to generate indexes of the sequence entriesbased on functional, structural, or other categories. The keywords chosen for each entry serve as a subjectreference for the sequence. The SWISS-PROT document KEYWLIST.TXT lists all the keywords that areused in the database. Often several KW lines are necessary for a single entry.

The format of the KW line is:

KW Keyword[; Keyword...].

More than one keyword may be listed on each KW line; semicolons separate the keywords, and the lastkeyword is followed by a period. Keywords may consist of more than one word (they may contain blanks), butare never split between lines. An example of a KW line is:

KW Oxidoreductase; Acetylation.

The order of the keywords is not significant. The above example could also have been written:

KW Acetylation; Oxidoreductase.

(3.13). The FT line

The FT (Feature Table) lines provide a precise but simple means for the annotation of the sequence data.The table describes regions or sites of interest in the sequence. In general the feature table lists post-translational modifications, binding sites, enzyme active sites, local secondary structure or othercharacteristics reported in the cited references. Sequence conflicts between references are also included inthe feature table. The feature table is updated when more becomes known about a given sequence.

Page 36: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

36

The FT lines have a fixed format. The column numbers allocated to each of the data items within each FT lineare shown in the following table (column numbers not referred to in the table are always occupied by blanks).

Columns Data item1- 2 FT6-13 Key name15-20 ‘FROM' endpoint22-27 ‘TO' endpoint35-75 Description

The key name and the endpoints are always on a single line, but the description may require continuation.For this purpose, the next line contains blanks in the key, the ‘FROM', and the ‘TO' columns positions, and thedescription is continued in its normal position. Thus a blank key always denotes a continuation of theprevious description.

An example of a feature table is shown below:

FT NON_TER 1 1FT SIGNAL <1 8 BY SIMILARITY.FT CHAIN 10 107 UNICORNIN 2.FT PROPEP 108 147 REMOVED BY PEPTIDASE XRT-2.FT MOD_RES 9 9 AMIDATION (G-10 PROVIDE AMIDE GROUP)FT (BY SIMILARITY).FT DISULFID 56 67FT CARBOHYD 114 114 N-LINKED (GLCNAC...)(POTENTIAL).FT CONFLICT 102 102 D -> S (IN REF. 2).FT CONFLICT 105 105 MISSING (IN REF. 3).

The first item on each FT line is the key name, which is a fixed abbreviation (up to 8 characters) with adefined meaning. A list of the currently defined key names can be found in Appendix A of this document.

Following the key name are the ‘FROM' and ‘TO' endpoint specifications. These fields designate (inclusively)the endpoints of the feature named in the key field. In general, these fields simply contain residue numbersindicating positions in the sequence as listed. Note that these positions are always specified assuming anumbering of the listed sequence from 1 to n; this numbering is not necessarily the same as that used in theoriginal reference(s). The following should be noted in interpreting these endpoints:

• If the ‘FROM' and ‘TO' specifications are equal, the feature indicated consists of the single amino acid atthat position;

• When a feature is known to extend beyond the end(s) of the sequenced region, the endpoint specificationwill be preceded by ‘<’ for features which continue to the left end (N-terminal direction) or by ‘>’ forfeatures which continue to the right end (C-terminal direction);

• Unknown endpoints are denoted by ‘?'.

See also the notes concerning each of the key names in Appendix A.

Page 37: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

37

The remaining portion of the FT line is a description that contains additional information about the feature. Forexample, for a post-translationally modified residue (key MOD_RES) the chemical nature of that modification isgiven, while for a sequence variation (key VARIANT) the nature of the variation is indicated. This portion ofthe line is generally in free form, and may be continued on additional lines when necessary.

(3.14). The SQ line

The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of itscontent.

The format of the SQ line is:

SQ SEQUENCE XXXX AA; XXXXX MW; XXXXXXXXXXXXXXXX CRC64;

The line contains the length of the sequence in amino acids (‘AA’) followed by the molecular weight (‘MW’)rounded to the nearest mass unit (Dalton) and the sequence 64-bit CRC (Cyclic Redundancy Check) value(‘CRC64’). The algorithm to compute the CRC64 is described in the ISO 3309 standard. It should be notedthat, while in theory, two different sequences could have the same CRC64 value, the likelihood that thiswould happen is quite low.

An example of a SQ line is shown here:

SQ SEQUENCE 233 AA; 25630 MW; 146A1B48A1475C86 CRC64;

The information in the SQ line can be used as a check on accuracy or for statistical purposes. The word‘SEQUENCE' is present solely for readability.

(3.15). The sequence data line

The sequence data line has a line code consisting of two blanks rather than the two-letter codes used up untilnow. The sequence is written 60 amino acids per line, in groups of 10 amino acids, beginning in position 6 ofthe line.

The characters used for the amino acids are the standard IUPAC one letter codes (see Appendix B).

An example of sequence data lines is shown here:

GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ATNE

Page 38: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

38

(3.16). The // line

The // (terminator) line contains no data or comments. It designates the end of an entry.

Page 39: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

39

Appendix A: Feature table keys

The definition of each of the key names used in the feature table is explained here. It is probable that newkey names will be progressively added to this list. For each key a number of examples are presented.

(A.1). Change indicators

CONFLICT - Different papers report differing sequences.

Examples of CONFLICT key feature lines:

FT CONFLICT 33 33 MISSING (IN REF. 2).FT CONFLICT 60 60 P -> A (IN REF. 3 AND 4).FT CONFLICT 81 84 ASTQ -> GWT (IN REF. 3).

VARIANT - Authors report that sequence variants exist.

Examples of VARIANT key feature lines:

FT VARIANT 3 3 V -> I.FT VARIANT 87 87 L -> T (IN STRAIN 2.3.1).FT VARIANT 1 2 MISSING (IN 25% OF THE CHAINS).

VARSPLIC - Description of sequence variants produced by alternative splicing.

Examples of VARSPLIC key feature lines:

FT VARSPLIC 194 196 GRP -> DVR (IN SHORT FORM).FT VARSPLIC 197 211 MISSING (IN SHORT FORM).

MUTAGEN - Site which has been experimentally altered.

Examples of MUTAGEN key feature lines:

FT MUTAGEN 65 65 H->F: 100% LOSS OF ACTIVITY.FT MUTAGEN 123 123 G->R,L,M: DNA BINDING LOST.

(A.2). Amino-acid modifications

MOD_RES - Post-translational modification of a residue.

The chemical nature of the modification is given in the description. The general format of the MOD_RESdescription field is:

FT MOD_RES xxx xxx MODIFICATION (COMMENT).

Page 40: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

40

The most frequently occurring modifications are listed below.

Modification DescriptionACETYLATION N-terminal or otherAMIDATION Generally at the C-terminal of a mature active peptideBLOCKED Undetermined N- or C-terminal blocking groupFORMYLATION Of the N-terminal methionineGAMMA-CARBOXYGLUTAMIC ACID Of glutamateHYDROXYLATION Of asparagine, aspartic acid, proline or lysineMETHYLATION Generally of lysine or argininePHOSPHORYLATION Of serine, threonine, tyrosine, aspartic acid or histidinePYRROLIDONE CARBOXYLIC ACID N-terminal glutamate which has formed an internal cyclic lactamSULFATATION Generally of tyrosine

Examples of MOD_RES key feature lines:

FT MOD_RES 1 1 ACETYLATION.FT MOD_RES 11 11 PHOSPHORYLATION (BY PKC).FT MOD_RES 2 2 SULFATATION (BY SIMILARITY).FT MOD_RES 8 8 AMIDATION (G-9 PROVIDE AMIDE GROUP).FT MOD_RES 9 9 METHYLATION (MONO-, DI- & TRI-).

LIPID - Covalent binding of a lipidic moiety

The chemical nature of the bound lipid moiety is given in the description. The general format of the LIPIDdescription field is:

FT LIPID xxx xxx NAME OF THE ATTACHED GROUP (COMMENT).

The attached groups that are currently defined are listed below.

Attached group Description

MYRISTATE Myristate group attached through an amide bond to the N-terminal glycineresidue of the mature form of a protein [1,2] or to an internal lysine residue

PALMITATE Palmitate group attached through a thioether bond to a cysteine residue orthrough an ester bond to a serine or threonine residue [1,2]

FARNESYL Farnesyl group attached through a thioether bond to a cysteine residue [3,4]GERANYL-GERANYL Geranyl-geranyl group attached through a thioether bond to a cysteine residue

[3,4]GPI-ANCHOR Glycosyl-phosphatidylinositol (GPI) group linked to the alpha-carboxyl group of

the C-terminal residue of the mature form of a protein [5,6]N-ACYL DIGLYCERIDE N-terminal cysteine of the mature form of a prokaryotic lipoprotein with an amide-

linked fatty acid and a glyceryl group to which two fatty acids are linked by esterlinkages [7]

Page 41: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

41

References:

[1] Grand R.J.A.Biochem. J. 258:626-638(1989).

[2] McLhinney R.A.J.Trends Biochem. Sci. 15:387-391(1990).

[3] Glomset J.A., Gelb M.H., Farnsworth C.C.Trends Biochem. Sci. 15:139-142(1990).

[4] Sinensky M., Lutz R.J.BioEssays 14:25-31(1992).

[5] Low M.G.FASEB J. 3:1600-1608(1989).

[6] Low M.G.Biochim. Biophys. Acta 988:427-454(1989).

[7] Hayashi S., Wu H.C.J. Bioenerg. Biomembr. 22:451-471(1990).

Examples of LIPID key feature lines:

FT LIPID 1 1 MYRISTATE.FT LIPID 65 65 PALMITATE (BY SIMILARITY).FT LIPID 354 354 GPI-ANCHOR.

DISULFID - Disulfide bond.

The ‘FROM' and ‘TO' endpoints represent the two residues which are linked by an intra-chain disulfide bond. Ifthe ‘FROM' and ‘TO' endpoints are identical, the disulfide bond is an interchain one and the description fieldindicates the nature of the cross-link. Examples of DISULFID key feature lines:

FT DISULFID 27 44 PROBABLE.FT DISULFID 14 14 INTERCHAIN (WITH A LIGHT CHAIN).

THIOLEST - Thiolester bond.

The ‘FROM' and ‘TO' endpoints represent the two residues which are linked by the thiolester bond.

THIOETH - Thioether bond.

The ‘FROM' and ‘TO' endpoints represent the two residues which are linked by the thioether bond.

CARBOHYD - Glycosylation site.

This key describes the occurrence of the attachment of a glycan (mono- or polysaccharide) to a residue ofthe protein:

• The type of linkage (C- N- or O-linked) to the protein is indicated.

Page 42: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

42

• If the nature of the reducing terminal sugar is known, its abbreviation is shown between parenthesis. Ifthree dots “...” follow the abbreviation this indicates extension of the carbohydrate chain. Converselythe absence of the dots indicate that a single monosaccharide is linked.

Examples of CARBOHYD key feature lines:

FT CARBOHYD 52 52 N-LINKED (GLCNAC...) (POTENTIAL).FT CARBOHYD 162 162 O-LINKED (GLCNAC).FT CARBOHYD 10 10 O-LINKED (GALNAC...) (BY SIMILARITY).FT CARBOHYD 34 34 C-LINKED (MAN).

SE_CYS – Selenocysteine

This key describes the occurrence of a selenocysteine in the sequence record. Examples:

FT SE_CYS 58 58FT SE_CYS 12 12 POTENTIAL.

METAL - Binding site for a metal ion.

The description field indicates the nature of the metal. Examples of METAL key feature lines:

FT METAL 18 18 IRON (HEME AXIAL LIGAND).FT METAL 87 87 COPPER (POTENTIAL).

BINDING - Binding site for any chemical group (co-enzyme, prosthetic group, etc.).

The chemical nature of the group is given in the description field. Examples of BINDING key feature lines:

FT BINDING 14 14 HEME (COVALENT).FT BINDING 250 250 PYRIDOXAL PHOSPHATE.

(A.3). Regions

SIGNAL - Extent of a signal sequence (prepeptide).

TRANSIT - Extent of a transit peptide (mitochondrial, chloroplastic, cyanelle or for a microbody).

Examples of TRANSIT key feature lines:

FT TRANSIT 1 42 CHLOROPLAST.FT TRANSIT 1 34 CYANELLE (BY SIMILARITY).FT TRANSIT 1 25 MITOCHONDRION.FT TRANSIT 1 23 MICROBODY (POTENTIAL).

Page 43: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

43

PROPEP - Extent of a propeptide.

Examples of PROPEP key feature lines:

FT PROPEP 27 28 ACTIVATION PEPTIDE.FT PROPEP 550 574 REMOVED IN MATURE FORM.

CHAIN - Extent of a polypeptide chain in the mature protein.

Examples of CHAIN key feature lines:

FT CHAIN 21 119 BETA-2 MICROGLOBULIN.FT CHAIN 37 >42 FACTOR XIIIA.

PEPTIDE - Extent of a released active peptide.

Examples of PEPTIDE key feature lines:

FT PEPTIDE 13 107 NEUROPHYSIN 2.FT PEPTIDE 235 239 MET-ENKEPHALIN.

DOMAIN - Extent of a domain of interest on the sequence.

The nature of that domain is given in the description field. Examples of DOMAIN key feature lines:

FT DOMAIN 22 788 EXTRACELLULAR (POTENTIAL).FT DOMAIN 140 152 ANCESTRAL CALCIUM SITE.

CA_BIND - Extent of a calcium-binding region.

DNA_BIND - Extent of a DNA-binding region.

The nature of the DNA-binding region is given in the description field. Examples of DNA_BIND key featurelines:

FT DNA_BIND 335 415 ETS-DOMAIN.FT DNA_BIND 224 283 HOMEOBOX.FT DNA_BIND 16 67 MYB.FT DNA_BIND 135 200 TEA-DOMAIN.

NP_BIND - Extent of a nucleotide phosphate binding region.

The nature of the nucleotide phosphate is indicated in the description field. Examples of NP_BIND key featurelines:

FT NP_BIND 13 25 ATP.FT NP_BIND 45 49 GTP (POTENTIAL).FT NP_BIND 8 34 FAD (ADP PART).

Page 44: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

44

TRANSMEM - Extent of a transmembrane region.

ZN_FING - Extent of a zinc finger region.

The zinc finger ‘category’ is indicated in the description field. Examples of ZN_FING key feature lines:

FT ZN_FING 110 134 GATA-TYPE.FT ZN_FING 559 579 C4-TYPE.

SIMILAR - Extent of a similarity with another protein sequence.

Precise information, relative to that sequence is given in the description field. Examples of SIMILAR keyfeature lines:

FT SIMILAR 351 456 STRONG, WITH KAPPA CHAIN V REGIONS.FT SIMILAR 580 1182 WITH ERBB TRANSFORMING PROTEIN.

REPEAT - Extent of an internal sequence repetition.

Examples of REPEAT key feature lines:

FT REPEAT 75 85 1.FT REPEAT 86 96 2.FT REPEAT 97 107 3 (APPROXIMATE).

(A.4). Secondary structure

The feature table of sequence entries of proteins whose tertiary structure is known experimentally containsthe secondary structure information corresponding to that protein. The secondary structure assignment ismade according to DSSP (see Kabsch W., Sander C.; Biopolymers, 22:2577-2637(1983)) and theinformation is extracted from the coordinate data sets of the Protein Data Bank (PDB).

In the feature table only three types of secondary structure are specified: helices (key HELIX), beta-strand(key STRAND) and turns (key TURN). Residues not specified in one of these classes are in a ‘loop' or ‘random-coil' structure). Because the DSSP assignment has more than the three common secondary structureclasses, we have converted the different DSSP assignments to HELIX, STRAND, and TURN as shown in thetable below.

DSSPcode

DSSP definition SWISS-PROT assignment

H Alpha-helix HELIXG 3(10) helix HELIXI Pi-helix HELIXE Hydrogen bonded beta-strand (extended strand) STRANDB Residue in an isolated beta-bridge STRAND

Page 45: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

45

T H-bonded turn (3-turn, 4-turn or 5-turn) TURNS Bend (five-residue bend centered at residue i) Not specified

One should be aware of the following facts:

a� Segment length. For helices (alpha and 3-10), the residue just before and just after the helix as givenby DSSP participates in the helical hydrogen bonding pattern with a single H-bond. For some practicalpurposes, one can therefore extend the HELIX range by one residue on each side, e.g. HELIX 25-35 instead of HELIX 26-34. Also, the ends of secondary structure segments are less well defined forlower-resolution structures. A fluctuation of +/- one residue is common.

b� Missing segments. In low-resolution structures, badly formed helices or strands may be omitted in theDSSP definition.

c� Special helices and strands. Helices of length three are 3-10 helices, those of length four and longerare either alpha-helices or 3-10 helices (pi helices are extremely rare). A strand of length onecorresponds to a residue in an isolated beta-bridge. Such bridges can be structurally important.

d� Missing secondary structure. No secondary structure is currently given in the feature table in thefollowing cases:

• No sequence data in the PDB entry;• Structure for which only C-alpha coordinates are in PDB;• NMR structure with more than one coordinate data set;• Model (i.e. theoretical) structure.

Examples:

FT HELIX 3 14 FT TURN 15 15 FT TURN 20 21 FT STRAND 23 23 FT HELIX 25 35 (A.5). Others

ACT_SITE - Amino acid(s) involved in the activity of an enzyme.

Examples of ACT_SITE key feature lines:

FT ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS. FT ACT_SITE 99 99 CHARGE RELAY SYSTEM.

Page 46: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

46

SITE - Any other interesting site on the sequence.

Examples of SITE key feature lines:

FT SITE 285 288 PREVENTS SECRETION FROM ER. FT SITE 241 242 CLEAVAGE (BY ANIMAL COLLAGENASES). INIT_MET - Initiator methionine.

This feature key is mostly associated with a zero value in the ‘FROM' and ‘TO' fields to indicate that the initiatormethionine has been cleaved off and is not shown in the sequence:

FT INIT_MET 0 0 It is not used for cases where the initiator methionine is not cleaved-off except to indicate internal alternativeinitiation sites. Example: FT INIT_MET 44 44 FOR CYTOPLASMIC ISOFORM. NON_TER - The residue at an extremity of the sequence is not the terminal residue.

If applied to position 1, this signifies that the first position is not the N-terminus of the complete molecule. Ifapplied to the last position, it signifies that this position is not the C-terminus of the complete molecule. Thereis no description field for this key. Examples of NON_TER key feature lines:

FT NON_TER 1 1 FT NON_TER 150 150 NON_CONS - Non-consecutive residues.

Indicates that two residues in a sequence are not consecutive and that there are a number of unsequencedresidues between them. Examples of NON_CONS key feature lines:

FT NON_CONS 1036 1037 FT NON_CONS 33 34 N-TERMINAL / C-TERMINAL. UNSURE - Uncertainties in the sequence

Used to describe region(s) of a sequence for which the authors are unsure about the sequence assignment.

Page 47: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

47

Appendix B: Amino acid codes

The one-letter and three-letter codes for amino acids used in SWISS-PROT are those adopted by thecommission on Biochemical Nomenclature of the IUPAC-IUB (see the reference listed below).

One-letter code Three-letter code Amino-acid name A Ala Alanine R Arg Arginine N Asn Asparagine D Asp Aspartic acid C Cys Cysteine Q Gln Glutamine E Glu Glutamic acid G Gly Glycine H His Histidine I Ile Isoleucine L Leu Leucine K Lys Lysine M Met Methionine F Phe Phenylalanine P Pro Proline S Ser Serine T Thr Threonine W Trp Tryptophan Y Tyr Tyrosine V Val Valine B Asx Aspartic acid or Asparagine Z Glx Glutamic acid or Glutamine X Xaa Any amino acid

Reference IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature and Symbolism for Amino Acids and Peptides. Recommendations 1983. Eur. J. Biochem. 138:9-37(1984). See also: http://www.chem.qmw.ac.uk/iupac/AminoAcid/

Page 48: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

48

Appendix C: Format differences between the SWISS-PROT and EMBLdatabases

(C.1). Generalities

The format of SWISS-PROT follows as closely as possible that of the EMBL database. The general structureof an entry is identical in both databases. The data classes used in both databases are the same except thatSWISS-PROT does not make use of the ‘BACKBONE’, ‘UNREVIEWED' and ‘UNANNOTATED' data classes. Oneline type used in SWISS-PROT do not exist in the EMBL database (see section C.3); conversely SWISS-PROT does not currently make use of every EMBL line type (see section C.4).

(C.2). Differences in line types present in both databases

(C.2.1). The ID line (IDentification)

Differences with the EMBL database ID line format are:

• The entry name can be up to 10 characters long (instead of 9 in EMBL) and can begin with a numericalcharacter;

• EMBL entry ID lines have an additional three-letter taxonomic division ‘token' inserted between the dataclass and the molecule type;

• The molecule type is listed as ‘PRT' rather than ‘DNA' or ‘RNA';• The length of the molecule is followed by ‘AA' (Amino Acid) instead of ‘BP' (Base Pairs).

(C.2.2). The AC line (ACcession number)

The format of this line type completely follows that defined by the EMBL database. SWISS-PROT accessionnumbers do not overlap with those used in the EMBL/GenBank/DDBJ nucleotide sequence database.However, it should be noted that there are differences in the format of the accession numbers themselves. InSWISS-PROT accession numbers consist of 6 alphanumerical characters in the following format:

1 2 3 4 5 6

[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]

Examples: P01234; Q1AA12.

In EMBL, two different types of accession numbers co-exist:

1. Accession numbers with 6 alphanumerical characters, where the first character is any letter with theexception of O,P or Q and the five other characters are numbers (example: M23765);

2. Accession numbers with 8 alphanumerical characters, where the first two characters are letters and thefollowing six characters are numbers (example: AB001084).

Page 49: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

49

(C.2.3). The DT line (DaTe)

Differences with the EMBL database DT line format are:

• In EMBL there are two DT lines per entry instead of three in SWISS-PROT;• In EMBL the format of the DT line that indicates when an entry was created is identical to that defined in

SWISS-PROT; but the two DT lines that convey information relevant to the updating of an entry arereplaced by a single line in EMBL. This is shown in the example below.

DT lines in a SWISS-PROT entry:

DT 21-JUL-1986 (Rel. 01, Created) DT 23-OCT-1986 (Rel. 02, Last sequence update) DT 01-APR-1990 (Rel. 14, Last annotation update)

DT lines in an EMBL database entry:

DT 10-MAR-1990 (Rel. 22, Created) DT 12-APR-1990 (Rel. 23, Last updated, Version 3)

(C.2.4). The DE line (DEscription)

• In SWISS-PROT the species of origin is not included in the description;• In EMBL the last DE line is not terminated by a period. (C.2.5). The OS line (Organism Species)

• In some cases the SWISS-PROT OS line includes more than one organism name (when the relevantsequence is completely conserved in different species);

• In EMBL the last OS line is not terminated by a period. (C.2.6). The OG line (OrGanelle)

• EMBL makes a distinction between ‘Mitochondrion', and ‘Kinetoplast', while SWISS-PROT doesnot use the latter designation;

• EMBL makes a distinction between ‘Chloroplast’ and ‘Plastid’, while SWISS-PROT does not usethe latter designation;

• In EMBL the OG line is not terminated by a period. (C.2.7). The RP and RC lines

• In EMBL, contrariwise to SWISS-PROT, the RC line precedes the RP line;• In EMBL the RC line is in free format and is generally not used.

Page 50: THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER … · THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL Release 39, ... Danielle Coral, ... Falquet L. and Bairoch A.; Nucleic Acids

50

(C.2.8). The RT line (Reference Title)

• In EMBL the reference title is not terminated by a period, a question mark or an exclamation mark. (C.2.9). The FT line (Feature Table)

The format of this line is totally different from that currently defined for the EMBL database. The format usedin SWISS-PROT is similar to that which was used in older versions of the EMBL database, prior to theintroduction of the common EMBL/GenBank/DDBJ feature table.

(C.2.10). The CC line (Comment)

The comment lines, which are free text and can appear anywhere in an EMBL entry, are grouped together inthe SWISS-PROT database. They are always listed below the last reference line, and follow a precise syntax(see section 3.10).

(C.2.11). The SQ line (SeQuence header)

Although the rough format and purpose of this line type is conserved, its exact content differs from that of theEMBL database. The numerical length of the sequence is listed, followed by ‘AA‘ (Amino Acid) instead of ‘BP‘(Base Pairs). To replace the sequence composition which, for protein sequences, would not fit in a singleline, the molecular weight and the 64-bit CRC (Cyclic Redundancy Check) value of the sequence areindicated.

(C.3). Line types defined by SWISS-PROT but currently not used by EMBL

Presently, there is only one line type which exists in SWISS-PROT and which is not used in the EMBLdatabase; it is the GN line.

(C.4). Line types defined by EMBL but currently not used by SWISS-PROT

There are three line types which exist in the EMBL database and which are not, presently, used in SWISS-PROT:

• FH and XX. The FH and XX lines contain no data and are present in EMBL only to improve readability ofan entry when it is printed or displayed on a terminal screen. These lines are not included in SWISS-PROT so as to keep it as compact as possible and thereby facilitate its use on small computer systems.

• SV. The SV (Sequence Version) line contains an identifier specific to nucleic acid sequences. It has nomeaning in the context of SWISS-PROT.