30
BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING

BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

BIOLOGICAL AND MEDICAL PHYSICS,BIOMEDICAL ENGINEERING

Page 2: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

BIOLOGICAL AND MEDICAL PHYSICSBIOMEDICAL ENGINEERING

Editor-in-Chief:EliasGreenbaum, Oak RidgeNational Laboratory,Oak Ridge,Tennessee, USA

Volumes Published in This Series:ThePhysics ofCerebrovascular Diseases, Hademenos, G.J., and Massoud, T.E,1997

Lipid Bilayers: Structureand Interactions, Katsaras, J.,1999

Physics with IllustrativeExamplesfrom Medicine and Biology: Mechanics, Second Edition,Benedek,G.B., and Villars, EM.H., 2000

Physics with IllustrativeExamplesfrom Medicine and Biology: StatisticalPhysics, SecondEdition, Benedek,G.B., and Villars, EM.H., 2000

Physics with IllustrativeExamplesfrom Medicine and Biology: Electricity and Magnetism,Second Edition, Benedek,G.B., and Villars, EM.H., 2000

Physics ofPulsatile Flow, Zamir, M., 2000

MolecularEngineeringofNanosystems, Rietman, E.A., 2001

BiologicalSystems UnderExtreme Conditions: Structureand Function, Taniguchi, Y. et al., 2001

IntermediatePhysicsfor Medicine and Biology, Third Edition, Hobbie, R.K., 2001

Epilepsy as a Dynamic Disease, Milton, J., and lung, P. (Eds), 2002

Photonics ofBiopolymers, Vekshin, N.L.,2002

Photocatalysis:Science and Technology, Kaneko, M., and Okura, I., 2002

E. coli in Motion, Berg, H.C., 2004

Biochips: Technology and Applications, Xing,W.-L., and Cheng, J. (Eds.), 2003

Laser-TissueInteractions: Fundamentalsand Applications, Niemz,M., 2003

Medical Applications ofNuclear Physics, Bethge,K., 2004

BiologicalImaging and Sensing, Furukawa,T. (Ed.), 2004

Biomaterials and TissueEngineering, Shi, D., 2004

Biomedical Devices and TheirApplications, Shi, D., 2004

MicroarrayTechnology and Its Applications, Muller, U.R.,and Nicolau, D.V. (Eds), 2004

Emergent Computation: EmphasizingBioinformatics, Simon, M., 2005

Molecular and CellularSignaling, Beckerman,M., March 22, 2005

The Physics ofCoronaryBlood Flow, Zamir, M., May, 2005

The Physics ofBirdsong Mindlin, G.B., Laje, R.,August,2005

Radiation Physicsfor Medical Physicists Podgorsak,E.B., September 2005

Neutron Scattering in Biology-Techniques and Applications Fitter,J.,Gutberlet, T.,Katsaras, J.(Eds.), January 2006

Forthcoming Titles

Topology in Molecular Biology: DNA and Proteins Monastyrsky, M.1. (Ed.), 2006

OpticalPolarization in BiomedicalApplications Tuchin,V.Y., Wang,L. (et al.), 2006

ContinuedAfter Index

Page 3: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Ying Xu, Dong Xu, andHe Liang (Eds.)

Computational Methodsfor Protein StructurePrediction and ModelingVolume 2: Structure Prediction

~ Springer

Page 4: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

YingXuDepartment of Biochemistry

and Molecular BiologyUniversity of Georgia120 Green StreetAthens, GA 30602

USAemail: [email protected]

Iie LiangDepartment of BioengineeringCenter for BioinformaticsUniversity of Illinois at Chicago851 S. Morgan StreetChicago, IL 60607-063

USAemail: [email protected]

Library of Congress Control Number: 2006929615

ISBN 10: 0-387-33321-5ISBN 13:978-0387-33321-2

Printed on acid-free paper.

Dong XuDepartment of Computer ScienceDigital Biology LaboratoryUniversity of Missouri-Columbia201 Engineering Building WestColumbia, MO 65211

USAemail: [email protected]

© 2007 Springer Science+Business Media, LLCAll rights reserved. This work may not be translated or copied in whole or in part without the written permissionof the publisher (Springer Science+BusinessMedia, LLC,233Spring Street, NewYork,NY10013,USA),except forbriefexcerpts in connection with reviews or scholarly analysis. Use in connection with any form of informationstorage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology nowknown or hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even if they arenot identified as such, is not to be taken as an expression of opinion as to whether or not they are subject toproprietary rights.

9876543 2 1

springer. com

Page 5: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Preface

An ultimate goal of modern biology is to understand how the genetic blueprint ofcells (genotype) determines the structure, function, and behavior ofa living organism(phenotype). At the center of this scientific endeavor is characterizing the biochem-ical and cellular roles ofproteins, the working molecules of the machinery of life. Akey to understanding of functional proteins is the knowledge of their folded struc-tures in a cell, as the structures provide the basis for studying proteins' functionsand functional mechanisms at the molecular level.

Researchers working on structure determination have traditionally selected in-dividual proteins due to their functional importance in a biological process or path-way ofparticular interest. Major research organizations often have their own proteinX-ray crystallographic orland nuclear magnetic resonance facilities for structure de-termination, which have been conducted at a rate of a few to dozens of structures ayear. Realizing the widening gap between the rates ofprotein identification (throughDNA sequencing and identification of potential genes through bioinformatics anal-ysis) and the determination of protein structures, a number of large scientific initia-tives have been launched in the past few years by government funding agencies inthe United States, Europe, and Japan, with the intention to solve protein structuresen masse, an effort called structural genomics. A number of structural genomicscenters (factory-like facilities) have been established that promise to produce solvedprotein structures in a similar fashion to DNA sequencing. These efforts as well asthe growth in the size of the community and the substantive increases in the easeof structure determination, powered with a new generation of technologies such assynchrotron radiation sources and high-resolution NMR, have accelerated the rateof protein structure determination over the past decade. As of January 2006, theprotein structure database PDB contained rv34,500 protein structures.

The role of structure for biological sciences and research has grown consider-ably since the advent of systems biology and the increased emphasis on understand-ing molecular mechanisms from basic biology to clinical medicine. Just as everygeneticist or cell biologist needed in the 1990s to obtain the sequence of the genewhose product or function they were studying, increasingly, those biologists willneed to know the structure of the gene product for their research programs in thiscentury. One can anticipate that the rate of structure determination will continue togrow. However, the large expenses and technical details of structure determinationmean that it will remain difficult to obtain experimental structures for more than asmall fraction of the proteins of interest to biologists. In contrast, DNA sequencedetermination has doubled routinely in output for a couple of decades. The genomeprojects have led to the production of 100 gigabytes of DNA data in Genbank, and

v

Page 6: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

vi Preface

as the cost of sequencing continues to drop and the rate continuesto accelerate, thescientific community anticipates a day when every individual has the genes of theirinterest and the genomes of all related major organisms sequenced.

Structure determination of proteins began before nucleic acids could be se-quenced, which now appears almost ironic. As microchemistry technologies continueto mature, ever more powerful DNA sequencing instruments and new methods forpreparation of suitable quantities of DNA and cheaper, higher sequencing through-put, while enabling a revolution in the biological and biomedical sciences, also leftstructure determination way behind. As sequencing capacity matured in the last fewdecades of the twentieth century, DNA sequences exceeded protein structures byIO-fold, then IOO-fold, and now there is a IOOO-fold difference between the numberof genes in Genbank and the number of structures in the PDB. The order of magni-tude difference is about to jump again, in the era ofmetagenomics, as the analyses ofcommunities of largely unculturable organisms in their natural states come to dom-inate sequence production. The 1. Craig Venter Institute's Sargasso Sea experimentand other early metagenomics experiments at least doubled the number of knownopen reading frames (ORFs) and potential genes, but the more recent ocean voyagedata (or GOS) multipled the number on the order ofanother IO-fold,probably more.The rate of discovery of novel genes and correspondingly novel proteins has notleveled off, since nearly half of new microbial genomes tum out to be novel. Fur-thermore, in the metagenomics data, new families ofproteins are discovered directlyproportional to the rate of gene (ORF) discovery.

The bottom line is quite simple. Despite the several fold reduction in cost instructure determination due to the structural genomics projects-the NIH ProteinStructure Initiative and comparable initiatives around the world-and the steadyincrease in the rate of protein structure determination, the number of proteins withunknown structures will continue to grow vastly faster. At an early structural ge-nomics meeting in Avalon, New Jersey, the experimental community voted in favorofexperimentally solving 100,000 structures ofproteins with less than 30% sequenceidentity to proteins with known structures. This seemed to some theoreticians at thetime as solving "the protein structure problem" and removing the need for theory,simulation, and prediction. Now, while it appears that this goal is aiming too highfor just the initiative alone, certainly, the structural community will have 100,000structures in the PDB not long after the end of this decade-and probably soonerthan expected as costs continue to go down and technologies continue to advance.Yet, those 100,000 structures will be significantly less than 1% of the known ORFsgenes! The problem, therefore, is not about having structures to predict, but havingrobust enough methods to make predictions that are useful at deep levels in biology,from helping us infer function and directing experimental efforts to providing insightinto ligand binding, molecular recognition, drug discovery, and so on. The kind ofsuccess in terms of"reasonable" accuracy for "most" targets has been the grand suc-cess of the CASP competition (see Chapter 1) but is completely inadequate for thebiology ofthe twenty-first century and the expectations ofboth basic and applied lifesciences. Prediction is not at the requisite level ofcomprehensive robustness yet, andtherein is one of the features of critical importance of the discussions in this book.

Page 7: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Preface vii

Computational methods for predicting protein structure have been activelypursued for some time. Their acceptance and importance grew rapidly after the es-tablishment of a blind competition for predicting protein structure, namely, CAS~CASP involves theoreticians predicting then-unknown protein structures and theirverification and analysis following subsequent experimental determination. The val-idation of the general approach both enhanced funding and brought participants tothe field and pointed to the limitations ofcurrent methods and the value ofextensiveresearch into advanced computational tools. Overall, the rapidly growing importanceofstructural data for biology fueled the emergence ofa new branch ofcomputationalbiology and of structural biology, an interface between the methods ofbioinformat-ics and molecular biophysics, namely, structuralbioinformatics. Similar to genomicsequence analysis, bioinformatic studies of protein structures could lead to bothdeep and general or broad insights about aspects such as the folding, evolution, andfunction of proteins, the nature of protein-ligand and protein-protein interactions,and the mechanisms by which proteins act. The success of such studies could haveimmense impacts not just on science but on the whole society through providing in-sight into the molecular etiology ofdiseases, developing novel, effective therapeuticagents and treatment regimens, and engineering biological molecules for novel orenhanced biochemical functions.

As one of the most active research fields in bioinformatics, structural bioinfor-matics addresses a wide spectrum of scientific issues, including the computationalprediction of protein secondary and tertiary structures, protein docking with smallmolecules and with macromolecules (i.e., DNA, RNA, and proteins), simulation ofdynamic behaviors ofproteins, protein structure characterization and classification,and study of structure-function relationships. While proteins were viewed as es-sentially static three-dimensional structures up until the 1980s, the establishment ofcomputational methods, and subsequent advances in experimental probes that couldprovide data at suitable time scales, led to a revolution in how biologists think aboutproteins. Indeed, over the past few decades, computational studies using moleculardynamics simulations ofprotein structure have played essential roles in understand-ing the detailed functional mechanisms of proteins important in a wide variety ofbiological processes. Within the applied life sciences, protein docking has been ex-tensively applied in the drug discovery pipeline in the pharmaceutical and biotechindustry.

Protein structure prediction and modeling tools are becoming an integral part ofthe standard toolkit in biological and biomedical research. Similar to sequence anal-ysis tools, such as BLAST for sequence comparison, the new methods for structureprediction are now among the first approaches used when starting a biological inves-tigation, conducted prior to actual experimental design. That computational analysiswould become the first step for experimentalists represents a major paradigm shiftthat is still occurring but is clearly essential to deal with the maturation of the field,the large quantities of data, and the complexity of biology itself as reflected in therequirement for today's powerful experimental probes used to address sophisticatedquestions in biology. This paradigm shift was noted first by Wally Gilbert, in a pre-scient article fifteen years ago ("Toward a new paradigm for molecular biology,"

Page 8: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

viii Preface

Nature 1991, 349:99), who asserted that biologists would have to change their modeofapproach to studying nature and to begin each experimental project with a bioin-formatics analysis of extant literature and other computational approaches. Thisparadigm shift is deeply interconnected with the increased emphasis on computa-tional tools and the expectation for robust methods for structure prediction.

Similar to other fields ofbioinformatics, structural bioinformatics is a rapidlygrowing science. New computational techniques and new research foci emerge everyfew months, which makes the writing of textbooks a challenging problem. While anumber of books have been published covering various aspects of protein structureprediction and modeling, it is widely recognized that the field lacks a comprehensiveand coherent overview ofthe science of "protein structure prediction and modeling,"which span a range from very basic problems (around physical and chemical prop-erties and principles), such as the potential function and free energies that determinethe folded shape ofa protein, to the algorithmic techniques for solving various struc-ture prediction problems, to the engineering aspects of implementation ofcomputerprediction software, and to applications ofprediction capabilities for investigationsfocused on functional properties. As educators at universities, we feel that there isan urgent need for a well-written, comprehensive textbook, one that proverbiallygoes from soup to nuts, and that this requirement is most critical for beginners en-tering this field as young students or as experienced researchers coming from otherdisciplines.

This book is an attempt to fill this gap by providing systematic expositions ofthecomputational methods for all major aspects ofprotein structure analysis, prediction,and modeling. We have designed the chapters to address comprehensively the maintopics of the field. In addition, chapters have been connected seamlessly through asystematic design of the overall structure of the book. We have selected individualtopics carefully so that the book would be useful to a broad readership, includingstudents, postdoctoral fellows, research scientists moving into the field, as well asprofessional practitionerslbioinformatics experts who want to brush up on topicsrelated to their own research areas. We expect that the book can be used as a textbookfor upper undergraduate-level or graduate-level bioinformatics courses. Extensiveprior knowledge is not required to read and comprehend the information presented.In other words, a dedicated reader with a college degree in computational, biological,or physical science should be able to follow the book without much difficulty. Tofacilitate learning and to articulate clearly to the reader what background is neededto obtain the maximum benefit from the book, we have included four appendicesdescribing the prerequisites in (1) biology, (2) computer science, (3) physics andchemistry, and (4) mathematics and statistics. If a reader lacks knowledge in aparticular area, he or she could benefit by starting from the references provided inthe corresponding appendix.

While the chapters are organized in a logical order, each chapter in the book isa self-contained review of a specific subject. Hence, a reader does not need to readthrough the chapters sequentially. Each chapter is designed to cover the followingmaterial: (1) the problem definition and a historical perspective, (2) a mathematicalor computational formulation of the problem, (3) the computational methods and

Page 9: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Preface ix

algorithms, (4) the performance results, (5) the existing software packages, (6) thestrengths, pitfalls, and challenges in current research, and (7) the most promisingfuture directions. Since this is a rapidly developing field that encompasses an ex-ceptionally wide range of research topics, it is difficult for any individual to write acomprehensive textbook on the entire field. We have been fortunate in assemblinga team of experts to write this book. The authors are actively doing research at theforefront of the major areas of the field and bring extensive experience and insightinto the central intellectual methods and ideas in the subdomain and its difficulties,accomplishments, and potential for the future.

Chapter 1 (A Historical Perspective and Overview of Protein StructurePrediction) gives a perspective on the methods for the prediction ofprotein structureand the progress that has been achieved. It also discusses recent advances and therole of protein structure modeling and prediction today, as well as touching brieflyon important goals and directions for the future.

Chapter 2 (Empirical Force Fields) addresses the physical force fields used inthe atomic modeling ofproteins, including bond, bond-angle, dihedral, electrostatic,van der Waals, and solvation energy. Several widely used physical force fields areintroduced, including CHARMM, AMBER, and GROMOS.

Chapter 3 (Knowledge-Based Energy Functions for Computational Studiesof Proteins) discusses the theoretical framework and methods for developingknowledge-based potential functions essential for protein structure prediction,protein-protein interaction, and protein sequence design. Empirical scoring func-tions including single-body energy function, statistical method for pairwise interac-tion between amino acids, and scoring function based on optimization are addressed.

Chapter 4 (Computational Methods for Domain Partitioning of ProteinStructures) covers the basic concept of protein structural domains and practicalapplications. A number of computational techniques for domain partition are de-scribed, along with their applications to protein structure prediction. Also describedare a few, widely used, protein domain databases and associated analysis tools.

Chapter 5 (Protein Structure Comparison and Classification) discusses the ba-sic problem ofprotein structure comparison and applications, and computational ap-proaches for aligning two protein structures. Applications of the structure-structurealignment algorithms to protein structure search against the PDB and to proteinstructural motif search in the PDB are also discussed.

Chapter 6 (Computation of Protein Geometry and Its Applications: Packingand Function Prediction) treats protein structures as 3D geometrical objects, anddiscusses structural issues from a geometric point of view, such as (1) the unionof ball models, molecular surface, and solvent-accessible surface, (2) geometricconstructs such as Voronoi diagram, Delaunay triangulation, alpha shape, surfacegeometry (including cavities and pockets) and their computation, (3) local surfacesimilarity measure in terms of shape and sequence, and (4) function predictionbased on protein surface patterns. Also described are the application issues of thesecomputational techniques.

Chapter 7 (Local Structure Prediction of Proteins) covers protein secondarystructure prediction, supersecondary structure prediction, prediction of disordered

Page 10: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

x Preface

regions, and applications to tertiary structure prediction. A number of popular pre-diction software packages are described.

Chapter 8 (Protein Contact Maps Prediction) describes the basic principles forresidue contact predictions, and computational approaches for prediction ofresidue-residue contacts. Also discussed is the relevance to tertiary structure prediction. Anumber ofpopular prediction programs are introduced.

Chapter 9 (Modeling Protein Aggregate Assembly and Structure) describes thebasic problem of structure misfolding and implications, experimental approach fordata collection in support of computational modeling, computational approaches toprediction of misfolded structures, and related applications.

Chapter 10 (Homology-Based Modeling of Protein Structure) presents thefoundation for homology modeling, computational methods for sequence-sequencealignment and constructing atomic models, structural model assessment, and manualtuning ofhomology models. A number ofpopular modeling packages are introduced.

Chapter 11 (Modeling Protein Structures Based on Density Maps at Interme-diate Resolutions) discusses methods for constructing atomic models from densitymaps ofproteins at intermediate resolution, such as those obtained from electron cry-omicrosopy. Details of application of computational tools for identifying a-helices,B-sheets, as well as geometric analysis are described.

Chapter 12 (Protein Structure Prediction by Protein Threading) describes thethreading approach for predicting protein structure. It discusses the basic concepts ofprotein folds, an empirical energy function, and optimal methods for fitting a proteinsequence to a structural template, including the divide-and-conquer, the integerprogramming, and tree-decomposition approaches. This chapter also gives practicalguidance, along with a list of resources, on using threading for structure prediction.

Chapter 13 (De Novo Protein Structure Prediction) describes protein foldingand free energy minimization, lattice model and search algorithms, off-lattice modeland search algorithms, and mini-threading. Benchmark performance ofvarious toolsin CASP is described.

Chapter 14 (Structure Prediction of Membrane Proteins) covers the methodsfor prediction of secondary structure and topology ofmembrane proteins, as well asprediction oftheir tertiary structure. A list ofuseful resources for membrane proteinstructure prediction is also provided.

Chapter 15 (Structure Prediction of Protein Complexes) describes computa-tional issues for docking, including protein-protein docking (both rigid body andflexible docking), protein-DNA docking, and protein-ligand docking. It covers com-putational representation for biomolecular surface, various docking algorithms, clus-tering docking results, scoring function for ranking docking results, and start-of-the-art benchmarks.

Chapter 16 (Structure-Based Drug Design) describes computational issues forrational drug design based on protein structures, including protein therapeuticsbased on cytokines, antibodies, and engineered enzymes, docking in structure-based drug design as a virtual screening tool in lead discovery and optimization,and ligand-based drug design using pharmacophore modeling and quantitative

Page 11: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Preface xi

structure-activity relationship. A number of software packages for structure-baseddesign are compared.

Chapter 17 (Protein Structure Prediction as a Systems Problem) provides a novelsystematic view on solving the complex problem of protein structure prediction.It introduces consensus-based approach, pipeline approach, and expert system forpredicting protein structure and for inferring protein functions. This chapter alsodiscusses issues such as benchmark data and evaluation metrics. An example ofprotein structure prediction at genome-wide scale is also given.

Chapter 18 (Resources and Infrastructure for Structural Bioinformatics) de-scribes tools, databases, and other resources of protein structure analysis and pre-diction available on the Internet. These include the PDB and related databases andservers, structural visualization tools, protein sequence and function databases, aswell as resources for RNA structure modeling and prediction. It also gives informa-tion on major journals, professional societies, and conferences of the field.

Appendix 1 (Biological and Chemical Basics Related to Protein Structures)introduces central dogma of molecular biology, macromolecules in the cell (DNA,RNA, protein), amino acid residues, peptide chain, primary, secondary, tertiary, andquaternary structure of proteins, and protein evolution.

Appendix 2 (Computer Science for Structural Informatics) discusses computerscience concepts that are essential for effective computation for protein structureprediction. These include efficient data structure, computational complexity andNP-hardness, various algorithmic techniques, parallel computing, and programming.

Appendix 3 (Physical and Chemical Basis for Structural Bioinformatics) coversbasic concepts of our physical world, including unit system, coordinate systems,and energy surfaces. It also describes biochemical and biophysical concepts suchas chemical reaction, peptide bonds, covalent bonds, hydrogen bonds, electrostaticinteractions, van der Waals interactions, as well as hydrophobic interactions. Inaddition, this chapter discusses basic concepts from thermodynamics and statisticalmechanics. Computational sampling techniques such as molecular dynamics andMonte Carlo method are also discussed.

Appendix 4 (Mathematics and Statistics for Studying Protein Structures) coversvarious basic concepts in mathematics and statistics, often used in structural bioin-formatics studies such as probability distributions (uniform, Gaussian, binomial andmultinomial, Dirichlet and gamma, extreme value distribution), basics of informa-tion theory including entropy, relative entropy, and mutual information, Markovianprocess and hidden Markov model, hypothesis testing, statistical inference (maxi-mum likelihood, expectation maximization, and Bayesian approach), and statisticalsampling (rejection sampling, Gibbs sampling, and Metropolis-Hastings algorithm).

YingXuDong XuJie Liang

John Wooley

April 2006

Page 12: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Acknowledgments

During the editing of this book, we, the editors, have received tremendous helpfrom many friends, colleagues, and families, to whom we would like to take thisopportunity to express our deep gratitude and appreciation. First we would like tothank Dr. Eli Greenbaum of Oak Ridge National Laboratory, who encouraged usto start this book project and contacted the publisher at Springer on our behalf.We are very grateful to the following colleagues who have critically reviewed thedrafts of the chapters of the book at various stages: Nick Alexandrov, Nir Ben-Tal,Natasja Brooijmans, Chris Bystroff, Pablo Chacon, Luonan Chen, Zhong Chen,Yong Duan, Roland Dunbrack, Daniel Fischer, Juntao Guo, Jaap Heringa, XicheHu, Ana Kitazono, loan Kosztin, Sandeep Kumar, Xiang Li, Guohui Lin, ZhijieLiu, Hui Lu, Alex Mackerell, Kunbin Qu, Robert C. Rizzo, Ilya Shindyalov, AmbujSingh, Alex Tropsha, IosifVaisman, Ilya Vakser, Stella Veretnik, Bjorn Wallner, JinWang, Zhexin Xiang, Yang Dai, Xin Yuan, and Yaoqi Zhou. Their invaluable inputon the scientific content, on the pedagogical style, and on the writing style helped toimprove these book chapters significantly. We also want to thank Ms. Joan Yantkoof the University of Georgia for her tireless help on numerous fronts in this bookproject, including taking care of a large number of email communications betweenthe editors and the authors and chasing busy authors to get their revisions and othermaterial. Last but not least, we want to thank our families for their constant supportand encouragement during the process of us working on this book project.

xiii

Page 13: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Contents

Contributors. .... . . . . . . . . . . .. . . . . .. . . . .. . . .. . . . .. . . . . . . . .. . .. . .. . . . .. . . . . . . . . . . . . . . . .. . . . . xvii

12 Protein Structure Prediction by Protein Threading...................... 1Ying Xu, Zhijie Liu, Liming Cai, and Dong Xu

13 De Novo Protein Structure Prediction..... . ... . .. . ... . . . . . . . . . . . . ... . . . . . .. . . 43Ling-Hong Hung, Shing-Chung Ngan, and Ram Samudrala

14 Structure Prediction of Membrane Proteins................... 65XicheHu

15 Structure Prediction of Protein Complexes................................. 109Brian Pierce and Zhiping Weng

16 Structure-Based Drug Design.. 135Kunbin Qu and Natasja Brooijmans

17 Protein Structure Prediction as a Systems Problem. ..... . . . . . . .. . . . . . . . . 177Dong Xu and Ying Xu

18 Resources and Infrastructure for Structural Bioinformatics.. 207Dong Xu, Jie Liang, and Ying Xu

Appendix 1 Biological and Chemical Basics Related toProtein Structures 229Hong Guo and Haobo Guo

Appendix 2 Computer Science for Structural Informatics....... 241Guohui Lin, Dong Xu, and Ying Xu

Appendix 3 Physical and Chemical Basis for StructuralBioinformatics ...... . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . . . .. . .. . . . . . .. . .. . . . . . . . .. . . . . . .. . 267Hui Lu, Ognjen Perisic, and Dong Xu

xv

Page 14: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

xvi Contents

Appendix 4 Mathematics and Statistics for StudyingProtein Structures................................................................. 299Yang Dai and Jie Liang

Index........................................................................................ 317

Page 15: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Contributors

Natasja Brooijmans

Chemical and Screening Sciences

Wyeth Research

Pearl River, New York 10965

Christopher Bystroff

Department of Biology

Rensselaer Polytechnic Institute

Troy, New York 12180

Liming Cai

Department of Computer Science

University of Georgia

Athens, Georgia 30602-7404

Orhan Camoglu

Department of Computer Science

University of California Santa Barbara

Santa Barbara, California 93106

YangDai

Department of Bioengineering

University of Illinois at Chicago

Chicago, Illinois 60607-7052

HaoboGuo

Department of Biochemistry and

Cellular and Molecular Biology

University of Tennessee

Knoxville, Tennessee 37996

Hong Guo

Department of Biochemistry and

Cellular and Molecular

Biology

University of Tennessee

Knoxville, Tennessee 37996

Jun-tao Guo

Department of Biochemistry and

Molecular Biology

University of Georgia

Athens, Georgia 30602-7229

Carol K. Hall

Department of Chemical and

Biomolecular Engineering

North Carolina State University

Raleigh, North Carolina 27695

Jaap Heringa

Centre for Integrative Bioinformatics

Vrije Universiteit

1081 HV Amsterdam, The

Netherlands

xvii

Page 16: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

xviii

XicheHu

Departmentof Chemistry

University of Toledo

Toledo, Ohio 43606

Ling-Hong Hung

Departmentof Microbiology

University of Washington

Seattle,Washington 98195-7242

Xiang Li

Departmentof Bioengineering

University of Illinois at Chicago

Chicago, Illinois 60607-7052

Jie Liang

Department of Bioengineering

University of Illinois at Chicago

Chicago,Illinois 60607-7052

GuohuiLin

Departmentof Computing Science

University of Alberta

Edmonton, Alberta T6G 2E8, Canada

Zhijie Liu

Departmentof Biochemistryand

MolecularBiology

University of Georgia

Athens, Georgia30602-7229

Contributors

HuiLu

Departmentof Bioengineering

University of Illinoisat Chicago

Chicago, Illinois60607-7052

JianpengMa

Departmentof Biochemistry and

MolecularBiology

BaylorCollegeof Medicine

Houston, Texas 77030

and

Department of Bioengineering

Rice University

Houston, Texas 77005

Alexander D. MacKerell, Jr.

Departmentof Pharmaceutical

Chemistry

Schoolof Pharmacy

University of Maryland

Baltimore, Maryland21201

Shing-Chung Ngan

Department of Microbiology

University of Washington

Seattle,Washington 98195-7242

Ognjen Perisic

Departmentof Bioengineering

University of Illinois at Chicago

Chicago, Illinois 60607-7052

Page 17: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

Contributors

Brian Pierce

Department of Biomedical

Engineering

Boston University

Boston, Massachusetts 02215

Kunbin Qu

Department of Chemistry

Rigel Pharmaceuticals, Inc.

San Francisco, California 94080

Ram Samudrala

Department of Microbiology

University of Washington

Seattle, Washington 98195-7242

Ilya Shindyalov

San Diego Supercomputer Center

University of California SanDiegoSan Diego, California 92093-0505

Victor A. Simossis

Centre for Integrative Bioinformatics

Vrije Universiteit

1081 HV Amsterdam, The Netherlands

Ambuj K. Singh

Department of Computer Science

University of California Santa Barbara

Santa Barbara, California 93106

Stella Veretnik

San Diego Supercomputer Center

University of California San Diego

San Diego, California 92093-0505

Zhiping Weng

Department of Biomedical

Engineering

Boston University

Boston, Massachusetts 02215

Ronald B. Wetzel

Department of Structural Biology

Pittsburgh Institute for

Neurodegenerative Diseases

University of Pittsburgh School of

Medicine

Pittsburgh, Pennsylvania 15260

John C. Wooley

Associate Vice Chancellor for

Research

University of California San Diego

San Diego, California 92093-0043

Zhexin Xiang

Center for Molecular Modeling

Center for Information Technology

National Institutes of Health

Bethesda, Maryland 20892-5624

xix

Page 18: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

xx

Dong Xu

Computer Science Department

University of Missouri-Columbia

Columbia, Missouri 65211-2060

YingXu

Institute ofBioinformatics and

Department of Biochemistry

and Molecular Biology

University of Georgia

Athens, Georgia 30602-7229

Contributors

Yuzhen Ye

Bioinformatics and Systems Biology

Department

The Burnham Institute for Medical

Research

La Jolla, California 92037

Xin Yuan

Department of Computer Science

Florida State University

Tallahassee, Florida 32306

Page 19: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

12 Protein Structure Prediction by ProteinThreading

Ying Xu, Zhijie Liu, Liming Cai, and Dong Xu

12.1 Introduction

The seminal work of Bowie, Liithy, and Eisenberg (Bowie et aI., 1991) on "the in-verse protein folding problem" laid the foundation ofprotein structure prediction byprotein threading. By using simple measures for fitness ofdifferent amino acid typesto local structural environments defined in terms of solvent accessibility and proteinsecondary structure, the authors derived a simple and yet profoundly novel approachto assessing ifa protein sequence fits well with a given protein structural fold. Theirfollow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI.,1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on proteinfold recognition led to the development ofa new brand ofpowerful tools for proteinstructure prediction, which we now term "protein threading." These computationaltools have played a key role in extending the utility of all the experimentally solvedstructures by X-ray crystallography and nuclear magnetic resonance (NMR), pro-viding structural models and functional predictions for many ofthe proteins encodedin the hundreds of genomes that have been sequenced up to now.

What has made protein threading particularly attractive as a protein structureprediction tool is the observation that the number ofunique structural folds in natureis a few orders ofmagnitude smaller than the number ofproteins in nature (FinkelsteinandPtitsyn, 1987; BairochetaI., 2005). Although this is still not a fully resolved issue,both theoretical and statistical studies (Murzin et aI., 1995; Brenner et aI., 1996; Liet aI., 1996, 1998,2002; Wang, 1996; Orengo et aI., 1997; Holm and Sander, 1996a;Zhang and DeLisi, 1998) suggest that the number ofunique structural folds in natureranges from a few hundred to a few thousand. Clearly this is a significantly smallernumber than the number ofproteins in nature - as we understand now, the number ofdifferent living organisms on earth could range from millions to hundreds ofmillions(May, 1988). Since each organism often has at least thousands of different proteinsencoded in its genome, the total number of different proteins in nature is at least inthe tens ofbillions or possibly significantly higher, even without considering proteinvariants such as alternatively spliced proteins. This disparity suggests an effectiveparadigm for possibly solving all protein structures through combining experimentaland computational approaches, that is, to solve structures of proteins with uniquestructural folds using the expensive and time-consuming experimental techniques

Page 20: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

2 Ying Xu et ale

and computationally model the rest ofthe proteins using the experimental structuresas templates. This is the key strategy currently being employed by the worldwideStructural Genomics efforts (Gaasterland, 1998; Skolnick et aI., 2000; Baker andSali, 2001).

The basic idea of protein threading is to place (align or thread) the aminoacids of a query protein sequence, following their sequential order and allowinggaps, into structural positions ofa template structure in an optimal way measured byfitness scores outlined above. This procedure will be repeated against a collectionof previously solved protein structures for a given query protein. These sequence-structure alignments, i.e., the query sequence against different template structures,will be assessed using statistical or energetic measures for the overall likelihood ofthe query protein adopting each ofthe structural folds. The "best" sequence-structurealignment provides a prediction of the backbone atoms of the query protein, basedon their placements in the template structure. Currently, protein threading is beingwidely used in molecular biology and biochemistry labs, often for initial studiesof target proteins, as it may quickly provide structural and functional information,which could be used to guide further experimental design and investigation.

As a prediction technique, protein threading has a number of highly challeng-ing computational and modeling problems. These include (a) how to effectively andaccurately measure the fitness of a sequence placed in a template structure, (b) howto accurately and efficiently find the best alignment between a query sequence and atemplate structure based on a given set of fitness measures, (c) how to assess whichsequence-structure alignment among the ones against different template structuresrepresents a correct fold recognition and an accurate (backbone) structure predic-tion, and (d) how to identify which parts of a predicted structure are accurate andwhich parts are not. As researchers find more effective solutions to these and otherchallenging problems, we expect that protein threading will play an increasinglysignificant role in structural and functional studies of proteins.

12.2 Protein Domains, Structural Folds,and Structure Space

As of now, over a million protein sequences have been determined (Bairoch et aI.,2005), among which ~30,000 have had their tertiary structures experimentallysolved (Dutta and Berman, 2005). Given that there could be at least tens of bil-lions of different proteins in nature as discussed above, one interesting question,particularly relevant to the idea of protein threading, is how many unique proteinstructures or structural folds these proteins might have adopted.

To answer this question, we need to first look at the basic structural units ofproteins, called protein domains (Wetlaufer, 1973; Richardson, 1981). Protein do-main is extensively discussed in Chapter 4 of this book. Here we describe brieflythe concept of a domain from the perspective of threading. A structural domain is adistinct and compact structural unit that could fold independently ofother domains.

Page 21: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

12. Protein Structure Prediction by Protein Threading 3

While many proteins are single-domain proteins, there are proteins with two, three,or even more structural domains (Ekman et aI., 2005). Our study shows that inthe FSSP (Fold classification based on Structure-Structure alignment of Proteins)nonredundant set (Holm and Sander, 1996b) of the PDB, 67% of the proteins havesingle domains, 21% have two domains, 7% have three domains, and the remaining5% have four or more domains (Xu et aI., 2000a). This distribution may not nec-essarily reflect the actual domain distribution for all the proteins in nature as theset of known protein structures in PDB might have overrepresented small proteinsdue to the relative ease in solving these structures. For eukaryotic organisms, it wasestimated that at least two-thirds oftheir proteins are multidomain proteins (Gersteinand Hegyi, 1998; Gerstein, 1997, 1998; Doolittle, 1995; Apic et aI., 2001a,b). Thisnumber is somewhat smaller for bacterial and archaeal organisms but still repre-sents a significant percentage of all their proteins. Previously domain partition ofmultidomain proteins was typically done manually. To keep up with the rate atwhich protein structures are being solved, there is a clear need for more automateddomain-partitioning methods to process the newly solved structures. Currently com-puter programs are being used for partitioning a protein structure into individualdomains. Popular programs for this purpose include DALI (Dietmann and Holm,2001), DomainParser (Xu et aI., 2000a), and PDP (Alexandrov and Shindyalov,2003).

A protein domain could be part of different protein structures, through com-bining with other domains. Figure 12.1 shows an example of a domain in differ-ent proteins. Both proteins, RNA 3'-terminal phosphate cyclase and glutathioneS-transferase, have the thioredoxin fold domain, which has two layers, one with twoa-helices and one with four antiparallel r3-strands, although some details differ inthe two proteins. The parts other than the thioredoxin fold domain in the two pro-teins have no structural relationship. Since domains are the basic structural unitsof proteins, current studies on the number of unique structural folds in nature havebeen carried out on protein domains rather than whole proteins (hereafter, the term"proteins" will refer to single-domain proteins for simplicity of discussion).

To estimate the number of unique folds of proteins, one popular approach isthrough examining all protein families and the relationships between protein familiesand unique structural folds. Using the definition of SCOP classification scheme(Murzin et aI., 1995; Brenner et al., 1996), a proteinfamily represents a group oforthologous proteins (Makarova et aI. 1999; Gerlt and Babbitt, 2000; Tatusov et aI.,2000; Gelfand et al., 2000). The number ofprotein families in nature could possiblybe estimated through finding orthologous gene groups covering all the genes inthe genomes that have been sequenced. One such estimate suggests that there are23,100 such protein families (Orengo et al., 1994). This number has been used inlater studies on estimating the number of unique structural folds in nature. Otherstudies estimate this number to be in the tens of thousands (Koonin et al., 2002).Whether it is 23,100 or some other number ranging from 10,000 to 50,000, theidea is that all proteins in nature fall into one of these families. The estimation onthe number of unique structural folds is obtained through estimating the number of

Page 22: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

4 Ying Xu et al.

Fig. 12.1 An example ofthe same domain (thioredoxin fold domain shown in red, thick ribbons)appearing in different protein structures. (a) RNA 3'-terminal phosphate cyclase (PDB code Iqmi,chain A), with the thioredoxin fold domain in residues 185 to 279. (b) Glutathione S-transferase(PDB code IkOd,chain A), with the thioredoxin fold domain in residues 109 to 200.

families that each structural fold covers and studying the coverage distribution byall the known structural folds.

One of the estimates by Zhang and DeLisi (1998) suggests that the numberN of unique structural folds is probably around 700. This estimation is based onthe observation that the number of structural folds covering X number of proteinfamilies follows a power-law distribution, withXbeing a variable. In essence it saysthat a few structural folds each cover many families (e.g., TIM barrel fold covers31 protein families) while many structural folds each cover only a small number offamilies ; more generally, the number of structural folds decreases as their coverageof protein families increases . Specifically, Zhang and DeLisi proposed a formula

Page 23: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

12. Protein Structure Prediction by Protein Threading 5

which matches well with the known protein families and structural folds. Let Mand N represent the number of families and the number of unique folds in nature,respectively. The probability that a fold covers exactly x families is given by

P, = (1 - MjN)X-1 MIN.

Let M, and N, be the numbers of protein families and unique structural folds cur-rently having solved structures, respectively. Through a simple algebraic transfor-mation, Zhang and DeLisi showed that

which is used to estimate the number of unique structural folds. Using the knownnumbers of M, = 736 and N, = 361 at the time of the estimation, they estimatedthat N is roughly about 700, which many researchers will argue to be too small (seefollowing for more discussion).

Similar estimations have been made by other researchers, estimating the sizeof N ranging from a few hundred to a few thousand (Orengo et aI., 1994; Wang,1996), based on somewhat different assumptions. Coulson and Moult (2002) recentlydeveloped a new model for estimating N, based on the work of Zhang and DeLisi.Using the more recent data on the numbers of genes, gene families, and structuralfolds, they argued that there are two "special" cases that have not been treatedwell by previous estimation models. Based on their argument, they consider thatthere are three classes of structural folds, which are termed unifolds, mesofolds, andsuperfolds. Unifolds represent structural folds that each covers only one family ofproteins, superfolds represent structural folds, each ofwhich covers many structuralfolds, and mesofolds represent structural folds in between. For example, TIM barrelcovers 31 families, while many unifolds exist in SCO~ Based on their observation,they argued that previous models such as Zhang and DeLisi did not fit well withthe data of unifolds and superfolds. So a new piecewise model was then developedwhich treats unifolds, mesofolds, and superfolds, separately.

Using this new model, Coulson and Moult (2002) estimated that less than20% of the protein families belong to unifolds, while 20% of the families belongto a few dozen superfolds and the rest of the protein families belong to mesofolds.Considering that the estimated number of protein families ranges from 10,000 to50,000 (or 23,100 as one of the popular estimates suggests), we can infer that thenumber of unifolds ranges from 2,000 to 10,000. The number of mesofolds couldbe estimated using the Zhang and DeLisi model, based on the sizes of M, and Ns ,

after excluding the unifolds and superfolds. Hence, Coulson and Moult concludedthat the most probable size for the number ofmesofolds is about 400. The number ofsuperfolds is believed to be very small, possibly in the range of low dozens. Overallthis model suggests that over 80% ofthe protein families fold into a little over 400

Page 24: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

6 Ying Xu et al,

structuralfolds, themajorityofwhicharealreadyknown, whiletherestoftheproteinfamilies each belongs to a unique unifold.

The implication ofthis estimation is that about 80% ofthe protein families areamenable for structural modeling using protein threading techniques, assuming thatat least one protein in each of the meso- and superfolds has its structure solved. Ifexperimental facilities for structure solution will strategically select their solutiontargets to maximally cover all the meso- and superfolds, we could expect that atleast 80% of the protein families will be structurally modelable in the near future.This is exactly the strategy that the National Institute ofHealth (NIH) is using in itsProtein Structure Initiative (http://www.nigms.nih.gov/psi/). For the remaining 20%of protein families, it might take some time to have at least one solved structure ineach of the unifolds. Hence, the threading technique will be less applicable for thisclass ofproteins, at least in the near future.

There are a number of popular schemes and associated databases for clas-sification of proteins into structural folds, including SCOP (Murzin et aI., 1995),CATH (Orengo et aI., 1997), and FSSP (Holm and Sander, 1996b). These classifi-cation schemes classify all solved protein structures into different structural foldsand subclasses of structural folds. The classification of protein structures is essen-tially achieved through grouping protein structures into clusters ofsimilar structures,which can be done computationally through structure-structure alignments (Holmand Sander, 1996a).

SCOP (Murzin et aI., 1995; Brenner et aI., 1996; Andreeva et aI., 2004) groupsall protein structures essentially into a three-level classification tree. At the top level,SCOP (SCOP1.65) currently consists ofabout 800 structural folds, each ofwhich isfurther divided into superfamilies and then into families. While a family representsa group of orthologous proteins, a superfamily represents a group of homologousproteins, possibly made ofmultiple families. Currently SCOP consists ofabout 1300superfamilies and about 2400 families. Among the 800 structural folds, 489 have onlyone family each, which might represent unifolds; and 9 cover a large number offami-lies each, which are considered as superfolds by Coulson and Moult. One thing worthnoting is that among the 800 SCOP folds, only 36 represent membrane proteins. Thisis a reflection of the fact that only 2% of all the solved protein structures are mem-brane proteins (http://blanco.biomol.uci.edu/MembraneYroteins...xtal.html). Thissuggests that threading is generally not applicable to structure prediction of mem-brane proteins, at the present time.

SCOP's hierarchical classification ofstructural folds provides a convenient toolfor applications ofthreading methods, as query proteins falling into a SCOP proteinfamily are generally expected to have accurate structure predictions, while proteinswith structural homologues in a SCOP superfamily will have a good chance to havethe correct structural folds identified and some portions oftheir backbone structurespredicted accurately. In general, it still represents a challenge to correctly identifythe structural fold by a threading method if a query protein only has a structuralanalogue (i.e., similar structure but not homologous) in SCO~

Page 25: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

12. Protein Structure Prediction by Protein Threading

12.3 Fitting a Protein Sequence onto a Protein Structure

7

The realization that protein structures are clustered into structural folds in the struc-ture space and the number ofsuch clusters is possibly quite small has led to a new wayofpredicting protein structures in a more efficient and effective manner. The generalbelief is that different proteins fold into similar 3D shapes because at some level,they share similar interaction patterns among their residues and between the residuesand the environments. It has been shown that these interaction patterns could pos-sibly be captured using simple statistics-based energy models as exemplified by theearlier work ofEisenberg and colleagues (Bowie et al., 1991; Fischer et al., 1996a,b;Fischer and Eisenberg, 1996), the work of Sippl and colleagues (Sippl, 1990), andothers (Jones et al., 1992; Rost et al., 1997). These simple statistics-based energyfunctions have been used, for many cases, to distinguish the correct structural foldsfrom the incorrect ones and to distinguish the correct placements of the residuesin a query protein into the structural positions of a correct structural template fromthe incorrect ones. Placing the (backbone atoms of) residues of a query protein intothe correct structural positions in a correct structural fold gives a prediction of thebackbone structure of the query protein. To accomplish this, one would need twocapabilities: (a) an energy function whose global minimum will correspond to thecorrect placement of residues into the correct structural template, and (b) a compu-tational algorithm that can find the global minimum of the given energy function.We explain the basic idea of developing such energy functions in this section andleave the algorithmic issues to the next one.

In their earlier work (Bowie et al., 1991; Fischer et al., 1996a,b, Fischer andEisenberg, 1996), Eisenberg and colleagues demonstrated that simple residue-based,instead of atom-based, energy functions could provide substantial discriminatingpower in separating good from poor placements of individual residue types intodifferent structural environments, justifying the usage ofresidue-based energy func-tions. In their work, structural environments are simply defined in terms oftwo param-eters, solvent accessibility sol and secondary structure ss. Specifically the quantitysol of solvent accessibility is discretized into a number of intervals, say 30--40%exposed to the solvent. A secondary structure could be a helix, a beta-strand, or aloop, or it could be defined in terms of more refined categories of secondary struc-tures, say including different types oftums. Then a structural environment for eachresidue in a template structure could be defined using (sol, s), say (0-10% exposed,alpha-helix). Statistics could be collected from a collection of solved protein struc-tures about how frequent a particular type of amino acid appears in a particularstructural environment as we just defined. This can be done by going through allprotein structures under consideration to count the number of occurrences of eachamino acid type in each encountered structural environment. If we consider, say,three levels of solvent accessibility, {exposed, intermediately exposed, buried}, andthree types of secondary structures, we will have nine types of structural environ-ment. Under this assumption, the result of counting the numbers of occurrences

Page 26: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

8 Ying Xu et al,

above will be a 20 by 9 table, with each of the 20 rows representing an amino acidtype and each of the 9 columnsrepresenting a structuralenvironment.

Based on the collected statistics, we can build a simplepreference model tomeasure how preferred a particular amino acid type is to a particular structuralenvironment. This can be done using the following measure:

-In(O· ·IE· .)i.] i.j )»

where Oi,} represents the observed frequency of amino acid type i in structuralenvironment j and Ei,l represents the expected frequency of amino acid type i instructuralenvironment j. In the workof Eisenberg and colleagues, Ei,l is estimatedusing the frequency of aminoacid type i in all proteinsunder consideration. Hence,if an aminoacidtype i hasa higherfrequency in aparticularstructuralenvironment jthanits frequency overall, itwillbeassigneda negative score-In(Oi,}I Ei,l); otherwiseit will get a positivescoreor zero (when Oi,l = Ei,j). The higher Oi,l is comparedtoEi.j» the more negative the corresponding energyis. A popularname for this type ofenergyfunction issingletonenergy.Byperformingstatistical analysis on a database,one can get the scoringfunctionusing the above formulation for the 20 amino acidtypesappearing in thenine structuralenvironments. Table 12.1 shows sucha scoringfunction, as describedin Xu et al. (1998).

Table 12.1 The scoring function for the 20 amino acid types in three secondary structuretypes with three solvent accessibility levels. The three solvent accessibility types are buried,intermediate, and exposed, from the left column to the right for each secondary structuretype.

Residue Helix Sheet Loop

A -0.741 -0.007 -0.181 -0.430 0.642 1.367 0.223 0.235 0.070R 1.010 -0.687 -0.344 0.995 -0.479 0.340 1.406 -0.199 -0.190N 0.840 0.221 0.042 0.483 0.344 0.743 0.656 -0.247 -0.795D 1.113 0.243 -0.376 0.907 0.250 0.853 0.850 - 0.228 -0.783C -0.389 0.539 2.179 -1.208 0.492 2.651 -0.489 0.119 1.249

Q 0.587 -0.428 -0.573 0.618 0.058 0.221 1.220 0.067 -0.449E 1.074 -0.442 -0.842 1.168 -0.029 0.282 1.773 0.218 -0.527G 0.317 1.234 1.010 -0.161 0.714 1.167 -0.116 0.040 -0.849H 0.335 -0.217 0.199 -0.127 -0.265 0.848 0.519 -0.387 - 0.117I -0.686 -0.001 1.198 1.316 -0.156 1.295 0.182 0.408 0.996L -0.902 -0.178 0.987 0.930 0.256 1.672 0.009 0.229 0.977K 2.021 -0.212 0.743 1.818 - 0.209 -0.008 2.611 0.107 0.696M -0.858 0.378 0.872 -0.825 0.230 1.215 0.208 0.242 0.692F -0.585 0.089 1.146 -1.040 -0.168 1.085 -0.060 0.079 0.911p 1.126 0.698 0.723 0.845 0.704 1.287 -0.041 -0.621 -0.646S 0.328 0.454 0.111 -0.079 0.088 0.445 0.373 -0.073 -0.612T 0.302 0.247 0.341 - 0.211 -0.419 0.126 0.350 -0.014 -0.377W -0.468 -0.270 1.493 -0.816 -0.307 1.843 - 0.082 0.041 0.797Y -0.149 -0.285 0.991 -0.690 -0.704 0.919 0.461 - 0.292 0.723V -0.491 0.251 1.053 -1.352 -0.324 0.983 0.196 0.382 0.737

Page 27: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

12. Protein Structure Prediction by Protein Threading 9

When building such statistics-based energyfunctions, one needs to be carefulin selectingthe data set for statistics collection. Forexample, someproteinsin SCOP(or in PDB)havemorehomologous structuresthan the others,whichcouldpossiblyleadto biasedstatistics. Toremove this type of statisticbias in ourdata set,oneneedsto remove homologous structuresin the data set for statisticscollection. There are anumberofdatabases forthispurpose,suchas thenonredundant sequencerepresenta-tivesin FSSP, PDB-select(Hobohmet al., 1992), andPISCES(Wang andDunbrack,2003),whereno twoproteinshavehigher than a certainlevelof sequencesimilarity.

Anotherstatistics-based energyfunctionwidelyused in threadingprogramsisoften calledpairwise interaction energy. It measures the preference of having twoparticulartypes of amino acids that are spatiallyclose. One particular form of suchan energy function was developed by Sippl (1990). The basic idea of this energyfunction is to compare the observed frequency of a pair of amino acids within acertaindistancein solvedprotein structureswith the expectedfrequency of this pairof aminoacid types in a protein structure. Thebasic idea of suchan energyfunctioncomes from statistical mechanics where the probability Pi} of having a pairwiseinteraction betweenresidues i and j has the Boltzmanncorrelationto its energygij(Gibbsfree energy), definedas

( *)jg -gijPij = exp kT Z,

wherek is the Boltzmannconstant, T is the temperature, andZis a partitionfunction.When using a residue-averaged state as a reference state g(P), a knowledge-basedpotentialcan be derivedusing

( P~ )gij = gij - g = -kT In ; .

Specifically, if N o(i, j) and NE(i, j) represent, the observedand expectednumbersof amino acid types i andj within a certain distance, respectively, we can use thefollowing to measurethe preferenceof havingthese two types of aminoacids closeto each other:

-In(No(i, j)/NE(i, j)).

WhileNo(i, j) canbe collectedbygoingthroughtheproteinstructuresin the sampleset, an accurate estimation of NE(i, j) represents a challenge. There have been anumber of proposed models for estimatingthis quantity. Among these models, thesimplestone is the "independentreferencestate" model (Xu et al., 1998),in whichNE(i, j) is estimatedas follows:

Page 28: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

10 Ying Xu et ale

Table 12.2 Pairwise scoring functions between 20 amino acid types

A -14R 27 -2N 10 -9 -43D 22 -62 -42 2C 33 7 11 28 -192Q 3 -6 -20 7 19 -11E 12 -56 -14 14 12 1 7G 1 -8 -10 -27 9 -7 -3 -29H 6 -26 6 -45 -19 27 37 - 7 -45I -11 11 35 32 15 24 29 18 29 - 33L -18 26 36 37 24 2 25 24 20 -16 -28K 12 31 -20 -56 25 -18 -67 10 5 19 18 12M - 7 30 31 21 5 3 14 1 -1 -1 -11 30 -49F -6 6 20 28 3 7 23 11 16 -10 -20 -2 -27 -21P 17 -3 -21 -3 11 -8 -10 -7 -6 37 22 -5 3 -2 -21S 17 - 8 -22 -30 1 -16 -21 -19 -13 21 27 - 6 19 11 -16 -18T 6 6 -23 -20 37 -15 -21 -7 -24 11 22 -2 16 28 -10 -22 -21W 5 -15 -2 10 5 -1 16 -7 -21 -2 8 3 -1 3 -43 13 9-2Y 5 -13 5 27 6 -9 27 6 3 -16 -9 -31 -17 -1 -8 10 16 -10 -1V-II 17 30 43 20 18 23 20 20 -23 -22 27 -5 -4 5 27 7 10 -11 -32

ARND C QEGH I LKMF P S TWYV

Table 12.2 shows a scoring function for the preference between 20 types of aminoacids using the above formulation (Xu et aI., 1998).

There are more sophisticated models for defining the reference state, one ofwhich is the uniform distribution model (Sippl, 1990; DeWitte and Shakhnovich,1996; Lu and Skolnick, 2001; Samudrala and Moult, 1998), as discussed later. Thesemore complex models take more factors into consideration in building the referencestate, hence making the energy models more likely to be accurate.

In addition to using physics- or statistics-based energy functions, researchershave incorporated evolutionary information into energy model building. One ofthe earlier major improvements in energy function modeling is the incorporationof sequence profile information (Panchenko et aI., 2000; Zhou and Zhou, 2005)derived from homologous proteins into the energy models outlined above. It wasnoticed that when using a sequence profile ofa protein family (or superfamily) ratherthan a single (query) protein sequence, threading accuracy could be significantlyimproved (Panchenko et aI., 2000; Zhou and Zhou, 2005). The very basic idea ofthisgeneralized approach is that rather than asking the question "will protein sequenceA fit well with structural fold B?" we ask the more general question "will the wholefamily of protein A fit well with structural fold B?" Clearly if done properly, thisapproach could iron out some ofthe spurious predictions, caused by the appearanceofspecific individual sequences. Now a threading problem becomes a fitting problembetween a sequence profile and a structural fold. A sequence profile is defined interms ofa multiple sequence alignment ofthe members ofthe query protein's family(or superfamily), with each element being the frequency distribution ofthe 20 aminoacid types in this aligned position rather than a specific amino acid. To generalizethe aforementioned energy functions to take into account the profile information, wecan simply use the relative frequency of each amino acid in the position-dependentdistribution as a weight factor when calculating the energy values for each amino acid

Page 29: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

12. Protein Structure Prediction by Protein Threading 11

or amino acid pairs, and then sum over all possible amino acids or amino acid pairs.Specifically,let Pibe the relative frequency ofamino acid type i in a particular alignedposition with L Pi = 1.0. Then the Eisenberg type of energy can be calculated as

i

- '""' ~-ln(O- -IE- -).L...J Z l,j Z,j

i

Similarly, pairwise interaction energy could be generalized as follows:

-L PiPj In(No(i , j)/NE(i, j)).i.]

Other types ofenergy functions have also been used in existing threading programs,including fitness scores between specific amino acids and the secondary structuresin the template structure and threading alignment gap penalties. Typically theseenergy scores are combined using a simple weighted sum while often the scalingfactors are empirically determined, based on some training data.

It has been observed that distance-dependent pairwise interaction energy couldprovide more accurate threading results than distance-independent models as out-lined above. A distance-dependent energy could be estimated as follows:

u«. __ lnNo(i,j,r)(z,j,r) - N (. . )'

E 1,}, r

where r is the distance between residues i and j (possibly measured between theirC-beta atoms), No(i, j, r) is the observed number of pairs of residues (i, j) withina distance bin from r - ~r12 to r + ~r/2 in a database of folded structures forsome bin width Sr, and NE(i, j, r) is the expected number of pairs (i, j) withinthe same distance bin. The challenging issue in accurately estimating the inter-acting energy u(i,j,r) is how to estimate NE(i, j, r). Under the assumption thatwe are dealing with an ideal infinite liquid-state system within a volume V andresidues are distributed uniformly (Sippl, 1990; DeWitte and Shakhnovich, 1996;Lu and Skolnick, 2001; Samudrala and Moult, 1998), NE(i, j, r) can be estimatedusing

NE(i,j,r) = Ni · Nj · (41Tr2~r) IV,

where N, and N, are the numbers ofamino acid types iand j in the protein database,respectively. Researchers have realized that this model needs to be corrected whendealing with finite systems like a protein structure, to make the model more accuratewhen used in threading programs. Twoparticular corrections are made in the DFIRE(distance-scaled ideal gas reference state) energy model (Zhou and Zhou, 2002),a popular energy function for threadirtg. In the first correction, DFIRE used r"instead of r 2 , considering that the number of interaction pairs in a finite systemcould not actually reach the level of r 2 as in an infinite system, where a < 2 is

Page 30: BIOLOGICAL AND MEDICAL PHYSICS, BIOMEDICAL ENGINEERING · BIOLOGICAL AND MEDICAL PHYSICS BIOMEDICAL ENGINEERING Editor-in-Chief: EliasGreenbaum, Oak RidgeNational Laboratory,OakRidge,Tennessee,

12 YingXu et ale

determined throughminimizing thedistribution fluctuation of interaction distanceona set of trainingdata. In the secondcorrection, DFIREassumes that onlyshort-rangeinteractions need to be considered. That is, interaction energybecomes zero whenthe distance betweenthe interacting pairs is beyonda cutoffdistancercut. Afterthesecorrections, the interactionenergycould be estimatedusing the following formula:

U(i,j,r) =I

No(i,j,r)-11 n ( )Ct ,r t::,.r

- --No (i, j, rcut)rcut t::,.rcut

o r > rcut

whereconstant11 is relatedto the systemtemperature and can be determinedempir-ically.

These simple energy models have played key roles in making threadingpro-grams as popular and as useful as we have seen today. While they have been usedto help to solvemany structurepredictionproblems, the limitations of these simplemodels have also become quite clear as we have seen from the recent CASP pre-dictionresults-the improvement in predictionaccuracyhas been only incrementalin the past few CASPs (Kinch et aI., 2003). One of the key reasons for the incre-mental improvement comes from the crudeness of the threading energy functions.Currently, the algorithmic techniques for proteinthreadinghaveadvanced to a stagethat should be able to handle more sophisticated energy models in the threadingframework, which could lead to more accuratepredicted structures. We can expectthat more detailedand morephysics-based energyfunctions will emergein the nearfuture as the field is clearly in need of more accurate energy models for proteinthreading.

12.4 Calculating Optimal Sequence-Structure Alignments

The general form of threadingenergyfunctioncouldbe written as follows:

fiEs + BEp + Egap ,

whereEs measuresthe overall fitness ofputtingindividual residuetypesintospecificstructuralenvironments, Ep measuresthetotal interaction energyamongpairswithinthe cutoffdistance, and Egap represents the totalpenalty for the gaps in a sequence-structurealignment. Thescalingfactorsfi andj3 are typicallydeterminedempiricallythrough optimizingthe performance of a threadingprogramon a representative setofproteinpairs. Withthe optimizedfi andj3values, the goalof threadingis to findanalignment(orplacement) betweena queryproteinsequence and a templatestructurethat optimizesthe energyfunction.

In a sense,proteinthreadingis like sequence-sequence alignmentas it finds analignmentbetweena sequence of aminoacidsand a sequence of structuralpositions