5
Vol. 5, No. 1, January-June 2012, pp. 19-23, Published by Serials Publications, ISSN: 0973-7413 Representative Protein Sequence and Structure Database P.J. Magesh 1 , Kasturi brindha K. 2 and Abdul Usman W. 3 1 Department of Bioinformatics, Hindustan College of Arts and Science, Old Mahabalipuram Road, Padur, Kelambakkam, Chennai-603013. India, E-Mail: [email protected]; [email protected] 2 Department of Bioinformatics, Hindustan College of Arts and Science, Old Mahabalipuram Road, Padur, Kelambakkam, Chennai-603013. India, E-Mail: [email protected] 3 IBAB, ITPB, White Field, Bangalore-560066, India, E-mail: [email protected], [email protected] ABSTRACT: The database provides the information about the non-redundant protein dataset (1573 proteins) obtained from the Protein Data Bank. The information includes PDB ID, Length of the protein, Resolution, PDB Secondary structure, PDB secondary structure summary, PHD secondary structure prediction, PHD secondary structure prediction summary, sequence. We further revised the PDB Secondary structure summary by reducing the eight state assignments to three states. The database provides additional information about the revised PDB secondary structure summary and revised PHD secondary structure prediction summary, often useful for further improvement of prediction methods. The proteins with unknown 3D structure and function were searched against the RPSSD and found similarity in their secondary structure patterns with the proteins of known structure in our database. This secondary structure information of proteins is furthermore useful in recognizing novel folds in proteins with unknown 3D structure. We are in a process of developing an interactive web interface with the non-redundant protein dataset to enhance the prediction methods and to make this approach as a novel fold recognition method. Keywords: Bioinformatics; Protein structure prediction; Biocomputing; Protein database; Fold recognition Proteins occupy a central position in the architecture and functioning of living matter. They are intimately connected with all the phases of chemical and physical activity that constitute the life of the cell. The amino acid sequence of a protein contains interesting information in and of itself. A protein sequence can be compared and contrasted with the sequences of other proteins to establish its relationship, if any, to known protein families, and to provide information about the evolution of biochemical function. However, for the purpose of understanding protein function, the 3D structure of a protein is so far more useful than its sequence. In brief, Sequence determines structure, which in turn determines function. One of the most significant problems in biomedical research today is the prediction of protein structure from knowledge of the primary amino acid sequence. The large- scale genome sequencing efforts have made this problem even more significant, since the growth of sequences has easily outpaced the elucidation of protein structures. Therefore, theoretical approaches are needed to determine the structure of sequences for which experimental data are not yet available. A number of real world applications benefits from knowledge of the protein structure, including the discovery of mechanisms of action and structure-based drug design. It is difficult to predict the three dimensional structure of a protein from sequence alone. We can predict 3D structure for known protein sequences by homology modeling based on significant sequence identity (>25%) to known 3D structures (PDB). For the remaining sequences, which do not have significant sequence identity, the prediction problem has to be simplified using secondary structure predictions for each residue. Moreover, as the number of protein sequences is growing much faster than our ability to solve their structures experimentally (e.g. using x-ray crystallography)-creating an ever-widening sequence-structure gap, the pressure to solve the protein folding problem is increasing. As we all know proteins consist of secondary structure elements (helix, sheets, coils etc). The prediction of these elements might help us to understand more about the function of these proteins without determining its three-dimensional structure. Further it has been believed that prediction of secondary structures is a step towards the prediction of the three- dimensional structure of a protein 2 . Structure prediction provides a functional insight into the underlying concepts of protein three-dimensional structure. The thirst of structure prediction lies in recognizing novel folds in the proteins of unknown function. The fold recognition is a relatively new branch of protein structure prediction in which the amino acid sequence of the protein of interest is compared directly with all known structures. In drug discovery, fold recognition is useful in two ways. First, it is a powerful tool in target identification and validation because it allows the transfer of knowledge of

Representative Protein Sequence and Structure Databaseserialsjournals.com/serialjournalmanager/pdf/1343374870.pdf · Representative Protein Sequence and Structure Database P.J. Magesh1,

  • Upload
    buicong

  • View
    229

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Representative Protein Sequence and Structure Databaseserialsjournals.com/serialjournalmanager/pdf/1343374870.pdf · Representative Protein Sequence and Structure Database P.J. Magesh1,

Vol. 5, No. 1, January-June 2012, pp. 19-23, Published by Serials Publications, ISSN: 0973-7413

Representative Protein Sequence and Structure Database

P.J. Magesh1, Kasturi brindha K.2 and Abdul Usman W.3

1Department of Bioinformatics, Hindustan College of Arts and Science, Old Mahabalipuram Road, Padur, Kelambakkam,Chennai-603013. India, E-Mail: [email protected]; [email protected]

2Department of Bioinformatics, Hindustan College of Arts and Science, Old Mahabalipuram Road, Padur, Kelambakkam,Chennai-603013. India, E-Mail: [email protected]

3IBAB, ITPB, White Field, Bangalore-560066, India, E-mail: [email protected], [email protected]

ABSTRACT: The database provides the information about the non-redundant protein dataset (1573 proteins) obtained from theProtein Data Bank. The information includes PDB ID, Length of the protein, Resolution, PDB Secondary structure, PDB secondarystructure summary, PHD secondary structure prediction, PHD secondary structure prediction summary, sequence. We furtherrevised the PDB Secondary structure summary by reducing the eight state assignments to three states. The database providesadditional information about the revised PDB secondary structure summary and revised PHD secondary structure predictionsummary, often useful for further improvement of prediction methods. The proteins with unknown 3D structure and functionwere searched against the RPSSD and found similarity in their secondary structure patterns with the proteins of known structurein our database. This secondary structure information of proteins is furthermore useful in recognizing novel folds in proteinswith unknown 3D structure. We are in a process of developing an interactive web interface with the non-redundant proteindataset to enhance the prediction methods and to make this approach as a novel fold recognition method.

Keywords: Bioinformatics; Protein structure prediction; Biocomputing; Protein database; Fold recognition

Proteins occupy a central position in the architecture andfunctioning of living matter. They are intimately connectedwith all the phases of chemical and physical activity thatconstitute the life of the cell. The amino acid sequence of aprotein contains interesting information in and of itself. Aprotein sequence can be compared and contrasted with thesequences of other proteins to establish its relationship, ifany, to known protein families, and to provide informationabout the evolution of biochemical function. However, forthe purpose of understanding protein function, the 3Dstructure of a protein is so far more useful than its sequence.In brief, Sequence determines structure, which in turndetermines function.

One of the most significant problems in biomedicalresearch today is the prediction of protein structure fromknowledge of the primary amino acid sequence. The large-scale genome sequencing efforts have made this problemeven more significant, since the growth of sequences haseasily outpaced the elucidation of protein structures.Therefore, theoretical approaches are needed to determinethe structure of sequences for which experimental data arenot yet available. A number of real world applicationsbenefits from knowledge of the protein structure, includingthe discovery of mechanisms of action and structure-baseddrug design. It is difficult to predict the three dimensionalstructure of a protein from sequence alone. We can predict3D structure for known protein sequences by homology

modeling based on significant sequence identity (>25%) toknown 3D structures (PDB). For the remaining sequences,which do not have significant sequence identity, theprediction problem has to be simplified using secondarystructure predictions for each residue.

Moreover, as the number of protein sequences isgrowing much faster than our ability to solve their structuresexperimentally (e.g. using x-ray crystallography)-creatingan ever-widening sequence-structure gap, the pressure tosolve the protein folding problem is increasing. As we allknow proteins consist of secondary structure elements(helix, sheets, coils etc). The prediction of these elementsmight help us to understand more about the function of theseproteins without determining its three-dimensional structure.Further it has been believed that prediction of secondarystructures is a step towards the prediction of the three-dimensional structure of a protein2. Structure predictionprovides a functional insight into the underlying conceptsof protein three-dimensional structure. The thirst of structureprediction lies in recognizing novel folds in the proteins ofunknown function. The fold recognition is a relatively newbranch of protein structure prediction in which the aminoacid sequence of the protein of interest is compared directlywith all known structures.

In drug discovery, fold recognition is useful in twoways. First, it is a powerful tool in target identification andvalidation because it allows the transfer of knowledge of

Page 2: Representative Protein Sequence and Structure Databaseserialsjournals.com/serialjournalmanager/pdf/1343374870.pdf · Representative Protein Sequence and Structure Database P.J. Magesh1,

20 P.J. Magesh, Kasturi Brindha K. and Abdul Usman W.

one protein to another, even if the two proteins appear to beunrelated. Showing that a potential drug can adopt a similarstructure to that of another protein whose biochemicalfunction has been verified experimentally provides vitalinformation about the structure and function of the target.Alternatively, a protein could become a potential drug targetwhen it is shown to be related to a known drug target.

Second, fold recognition is an essential part ofcomparative modeling. Comparative modeling occursafter target identification and validation, and plays animportant role in lead generation and optimization. In thislater stage of drug development, a model of the threedimensional (3D) structure of the target is built on thestructure of the homologue, which provides a basis forstructure based drug design.

1. COMPARATIVE MODELLING

If an empirically determined 3D structure is available for asufficiently similar protein (50% or better sequence identitywould be good), you can use software that arranges thebackbone of your sequence identically to this template. Thisis called “comparative modeling” or homology modeling”.It is, at best, moderately accurate for the positions of alphacarbons in the 3D structure, in regions where the sequenceidentity is high7. It is inaccurate for the details of side chainpositions and for inserted loops with no matching sequencein the solved structure.

A comparative modeling routine needs three items ofinput:

1. The sequence of the protein with unknown 3Dstructure, the “target sequence”.

2. A 3D template is chosen by virtue of having thehighest sequence identity with the target sequence.The 3D structure of the template must bedetermined by reliable empirical methods such ascrystallography or NMR, and is typically apublished atomic coordinate “PDB” file from theProtein Data Bank.

3. An alignment between the target sequence and thetemplate sequence.

1.1 Useful Online Resources forComparative Modeling

EVA (Evaluation of automatic structure prediction servers)7

http://maple.bioc.columbia.edu/eva/doc/intro_cm.html

3D-JIGSAW

http://www.bmm.icnet.uk/servers/3djigsaw/

CPH Models 2.0 server

http://www.cbs.dtu.dk/services/CPHmodels/

ESypred 3D web server 1.0

http://www.fundp.ac.be/urbm/bioinfo/esypred/

2. METHODS AND RESOURCES

2.1 Nps@ Web Server

1. NPS@ stands for Network Protein Sequence Analysis.

2. NPS@ is an interactive web server dedicated to Proteinanalysis and available for the biological community at

http://npsapbil.ibcp.fr/cgibin/npsa_automat.pl?page=/NPSA/npsa_server.html.

3. NPS@ is the “protein part” of the “Pôle Bio-Informatique Lyonnais” (PBIL)

The server performs secondary structure prediction with12 different methods and a consensus prediction of thosemethods. Available methods are SOPM, SOPMA, HNN,MLRC, DPM, DSC, GORI, GORII, GORIV, PHD,PREDATOR and SIMPA96.

2.2 PHD (Profile Network from Heidelberg)

Accuracy of predicting protein secondary structure has beenimproved significantly by using evolutionary informationcontained in the multiple sequence alignments5. Oneimportant aspect when PHD was created is that it was trainedon a carefully selected dataset where all pairs have low pair-wise identity (<25%). This is necessary as homologyalignment predicts secondary structure better than any othermethod, and this set is used for careful cross validity studies.

2.3 Protein Data Bank

PDB is the single worldwide repository for the processingand distribution of 3D structure data of large molecules ofproteins and nucleic acids. PDB is available at

URL: http://www.rcsb.org/pdb

2.4 Sequence Details

Each residue in the sequence is reported as a single lettercode. Secondary structure is calculated and describedaccording to an implementation of the method of Kabschand Sander (1983) Biopolymers 22, 2577-2637. Theassignments are: H = helix; B = residue in isolated betabridge; E=extended beta strand; G = 310 helix; I = pi helix;T = hydrogen bonded turn; S=bend.

2.4 Dataset Construction

For the dataset construction we performed the followingsteps for 1573 proteins, which represent the non-redundantprotein dataset collected from protein data bank.

• Collecting the PDB secondary structureinformation alone from the sequence details usingperl programming. Ex: 1AVOA (11s regulator)

CCCCHHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHH TTTTCCSCTTSSCCCCCC

• Representing the secondary structure informationin the form of summary.

Page 3: Representative Protein Sequence and Structure Databaseserialsjournals.com/serialjournalmanager/pdf/1343374870.pdf · Representative Protein Sequence and Structure Database P.J. Magesh1,

Representative Protein Sequence and Structure Database 21

CHTHTCSCTSC

• Collecting the PHD secondary structureprediction for proteins of our interest.

Ex: 1AVOA

CCCCHHHHHHHHHHHHHHHHHHHHHHHHcChHHHHHHHHHHcCccchHhhHHHhCCCCCC

• Representing the secondary structure predictioninformation in the form of summary

CHcChHcCchHhHhC

• Finally entering all the data into the Database.

RPSSD PREDICTION FLOW CHART

Software used: oracle 9i, tomcat apache, sql loader

4. RESULTS AND DISCUSSION

In order to compare the prediction accuracy of PHD, wecompared the PHD secondary structure prediction (PHDSS)summary with PDB secondary structure (PDBSS) summaryand obtained the following results.

• Out of 1573 proteins, 17 proteins share equal lengthcorresponding to their PHD and PDB secondarystructure summary, in which 4 proteins haveidentical secondary structure summary.

After reducing the 8 state assignments of PDBsecondary structure summary (H, B, G, T, C, I, S, E) to 3state (H, C, E).

• Out of 1573 proteins, 30 proteins share equal lengthcorresponding to their PHD and PDB secondarystructure summary, in which 17 proteins shareexactly identical secondary structure summary ofboth PDB and PHD, which shows an improvementin the accuracy.

The comparison of PHDSS summary of 60 proteinswith unknown function from Mycobacterium tuberculosisH37Rv (lab strain) against the RPSSD resulted in fewsignificant hits.

• NTO2MT2765 (host cell receptor binding protein-related) obtained 28 hits from PDBSS summaryand 1 hit from PHDSS summary.

Out of 28 hits, two hits from PDBSS summary werefound to be exactly identical with that of the query(NTO2MT2765).

1) 1TIA_: Lipase (triacylglycerol acylhydrolase)

2) 1JTDB: tem-1 beta-lactamase

5. SEQUENCE COMPARISION

6.1 ClustalW Alignment (Query Vs Hits)

Figure 1: RPSSD Data Flow-overview

3. RDBMS (RELATIONAL DATABASEMANAGEMENT SYSTEMS)

ADVANTAGES:

• Possible to design complex data storage andretr ieval systems with ease (and withoutconventional programming).

• Support for very large databases.

• Automatic optimization of searching (whenpossible).

Figure 2: RPSSD Database Model

PDBID DESCRIPTION IDENTITIES EVALUE SCORE

1DEQA Chain A, The Crystal 30% 2.8 64Structure Of ModifiedBovine Fibrinogen

1KHZA Chain A, structure of 39% 6.2 61pyrophosphate dependent

phosphofructo kinasefrom the lyme diseasespirochete Borreliaburgdorferi.

Even though, the query shows less sequence similarity(<30%) with the hits, it shares 100% Identity with thePDBSS summary.

No. Of PSI blast hits for the query (NT02MT2765)against PDB: 2

Page 4: Representative Protein Sequence and Structure Databaseserialsjournals.com/serialjournalmanager/pdf/1343374870.pdf · Representative Protein Sequence and Structure Database P.J. Magesh1,

22 P.J. Magesh, Kasturi Brindha K. and Abdul Usman W.

The PSI- Blast hits obtained for the query(NT02MT2765), share very low sequence similarity withan Evalue worse than the threshold. At this level the betteralternative would be the comparison of secondary structuresummaries.

The hits obtained from the database provide betterinformation about the secondary structure similarity torecognize the folds, which in turn is applicable for proteinthree dimensional structure predictions.

• NT02MT3818 (Whi B2) obtained 12 hits withPDBSS summary and 21hits with PHDSSsummary.

Out of 21hits, seven hits from PHDSS summary wherefound to be exactly identical.

12 hits from PDBSS summary are almost similar inpattern, but differing in length.Figure 3: Multiple Sequence Alignment for 1TIA_, 1JTD_B,

NTO2MT2765

There is no significant similarity found for the query(NT02MT3818: Whi B) protein when blasted against thePDB.

The results obtained for the query (NT02MT3818)against our non-redundant dataset provide the similarityin the secondary structure pattern between the query andthe hits.

• NT02 MT1724 (nitrogen fixation protein, nif U),obtained 1hit from PDB SS summary, and 1hitfrom PHDSS summary.

PDBSS summary hit for NT02MT1724

1I4WA: mitochondrial replication protein

mtf1

PHDSS summary hit for NT02MT1724

Page 5: Representative Protein Sequence and Structure Databaseserialsjournals.com/serialjournalmanager/pdf/1343374870.pdf · Representative Protein Sequence and Structure Database P.J. Magesh1,

Representative Protein Sequence and Structure Database 23

1LQTA: fpra

• NTO2MT1724 (thiol-specific antioxidant protein,putative) obtained 1 hit from PDB SS summary.

1NC5A: hypothetical protein yter

6. SUMMARY

The database is created with a non-redundant protein dataset(1573 proteins). The secondary structures were predictedfor the representative protein set in order to compare theprediction results with the experimental results. Thesecondary structures obtained from PDB (experimental data)were further refined by reducing 8 state (H, B, G, I, T, C, E,S) assignments to 3 state (H, C, E) to improve the accuracy.

Reduction in the 8 state assignments to 3 states providesa better method for comparing the PHD summary of thequery sequence with that of the database. The hits obtainedfor the query gives the similarity of the secondary structureinformation rather than the sequence. The recognition ofthe folds of proteins of unknown structure can be predictedwith this approach, hopefully.

7. CONCLUSION

The library of complete genomes and experimentallydetermined protein structures will continue to expand and,combined with novel approaches for structure prediction,fold recognition methods are likely to improve and becomeessential tool in applications including drug discovery. Withan increase in sequencing speed and target species, theaverage achievable accuracy of prediction still has apotential for improvement. This database of non-redundantprotein datasets will facilitate the future development ofstructure prediction methods.

Acknowledgements

I wish to express my sincere gratitude to the almighty.

I am extremely thankful to The Directors, Hindustancollege of Arts and Science for their continuous support andencouragement.

References

[1] Rost, B., Sander, C. (1993). “Prediction of Protein SecondaryStructures at Better that 70% Accuracy”. J.Mol.Biol. 232,584-599.

[2] Rost, B., Sander, C., Schneider, R. (1994). “Redefining the Goalsof Protein Secondary Structure Prediction”. J.Mol.Biol. 235,13-26.

[3] Rost, B., Sander, C. (1993). “Improved Prediction ProteinSecondary Structure by Use of Sequence Profiles and NeuralNetworks”. Proc.Natl.Acad.Sc.U.S.A, 90,7558-7562.

[4] Rost, B., Sander, C., Schneider, R. (1994). “PHD - An AutomaticServer for Protein Secondary Structure Prediction”. CABIOS,10, 53-60.

[5] Rost , B. , Sander, C. (1994). “Combining EvolutionaryInformation and Neural Networks to Predict Protein SecondaryStructure”. Proteins, 19, 55-72.

[6] Rost, B., Sander, C., Schneider, R. (1994). “Evolution and NeuralNetworks-protein Secondary Structure Prediction Above 71%Accuracy”.

[7] Koh, I.Y. et.al (2003). “EVA: Evaluation of Protein StructurePrediction Servers”. Nucleic Acids Res. 31, 3311-3315.

[8] Rost B., Sander, C. (1995). “PHD: Predicting 1D ProteinStructure by Profile based Neural Networks”. In: Doolittle, R(ed.) Computer Methods for Macromolecular Sequence AnalysisMethods in Enzymology, 266, 525-539 (1996).

[9] Burkhard Rost. Protein Structure Prediction in 1D, 2D, and 3D”,Encyclopedia of Computational Chemistry, in press, (1998).

[10] William T. Katz. “Protein Secondary Structure Prediction, AReview of the Problems and Methods”. Biochem 218.

[11] Dmitrij Frishman, Patrick Argos. “The Future of SecondaryStructure Prediction Accuracy”. Folding & Design 11 Mar 1997,2, 159-162.

[12] Kenji Mizuguchi (2004). “Fold Recognit ion for DrugDiscovery”. DDT: Targets, 3, 18-23.

[13] Burkhard Rost (1999). “Neural Networks for Protein StructurePrediction: Hype or Hit”.

[14] Hobohm, U., Scharf, M., Schneider, R. and Sander, C. (1992).“Selection of Representative Protein Data Sets”. Protein Sci., 1,409-417