30
Bioinformatics for Glycobiology and Glycomics: an Introduction Edited by Claus-Wilhelm von der Lieth Formerly at the Molecular Structure Analysis Core Facility Deutsches Krebsforschungszentrum (German Cancer Research Center) Heidelberg, Germany Thomas L ¨ utteke Faculty of Veterinary Medicine Institute of Biochemistry and Endocrinology Justus-Liebig University Gießen Gießen, Germany and Martin Frank Molecular Structure Analysis Core Facility Deutsches Krebsforschungszentrum (German Cancer Research Center) Heidelberg, Germany A John Wiley & Sons, Ltd, Publication

Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Bioinformatics for Glycobiologyand Glycomics: an Introduction

Edited by

Claus-Wilhelm von der LiethFormerly at the Molecular Structure Analysis Core Facility

Deutsches Krebsforschungszentrum (German Cancer Research Center)Heidelberg, Germany

Thomas LuttekeFaculty of Veterinary Medicine

Institute of Biochemistry and EndocrinologyJustus-Liebig University Gießen

Gießen, Germany

and

Martin FrankMolecular Structure Analysis Core Facility

Deutsches Krebsforschungszentrum (German Cancer Research Center)Heidelberg, Germany

A John Wiley & Sons, Ltd, Publication

Page 2: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Page 3: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Bioinformatics for Glycobiologyand Glycomics

Page 4: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Page 5: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Bioinformatics for Glycobiologyand Glycomics: an Introduction

Edited by

Claus-Wilhelm von der LiethFormerly at the Molecular Structure Analysis Core Facility

Deutsches Krebsforschungszentrum (German Cancer Research Center)Heidelberg, Germany

Thomas LuttekeFaculty of Veterinary Medicine

Institute of Biochemistry and EndocrinologyJustus-Liebig University Gießen

Gießen, Germany

and

Martin FrankMolecular Structure Analysis Core Facility

Deutsches Krebsforschungszentrum (German Cancer Research Center)Heidelberg, Germany

A John Wiley & Sons, Ltd, Publication

Page 6: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

This edition first published 2009, C© 2009 by John Wiley & Sons Ltd.

Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific,Technical and Medical business with Blackwell Publishing.

Registered officeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Other Editorial Offices:9600 Garsington Road, Oxford, OX4 2DQ, UK111 River Street, Hoboken, NJ 07030-5774, USA

For details of our global editorial offices, for customer services and for information about how to apply forpermission to reuse the copyright material in this book please see our website atwww.wiley.com/wiley-blackwell.

The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, inany form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permittedby the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not beavailable in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand namesand product names used in this book are trade names, service marks, trademarks or registered trademarks oftheir respective owners. The publisher is not associated with any product or vendor mentioned in this book. Thispublication is designed to provide accurate and authoritative information in regard to the subject matter covered.It is sold on the understanding that the publisher is not engaged in rendering professional services. Ifprofessional advice or other expert assistance is required, the services of a competent professional should besought.

Library of Congress Cataloging-in-Publication Data

Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth,Thomas Lutteke, and Martin Frank.

p. ; cm.Includes bibliographical references and index.ISBN 978-0-470-01667-11. Glycomics. 2. Glycoconjugates–Research–Data processing. 3. Bioinformatics.

I. Lieth, Claus-Wilhelm von der. II. Lutteke, Thomas. III. Frank, Martin, 1963–[DNLM: 1. Glycomics–methods. 2. Carbohydrates–chemistry. 3. Computational Biology–methods.

4. Glycoproteins–chemistry. QU 75 B6155 2009]QP702.G577B56 2009572′.56–dc22

2009023580

A catalogue record for this book is available from the British Library.

Set in 10/12pt Times by Aptara Inc., New Delhi, India.Printed in Singapore by Markono Print Media Pte Ltd.

First Impression 2009

Page 7: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Contents

List of Contributors ix

Preface xvClaus-Wilhelm von der Lieth

Section 1: Introduction1. Glycobiology, Glycomics and (Bio)Informatics 3

Claus-Wilhelm von der Lieth

Section 2: Carbohydrate Structures2. Introduction to Carbohydrate Structure and Diversity 23

Stephan Herget, Rene Ranzinger, Robin Thomson,Martin Frank and Claus-Wilhelm von der Lieth

3. Digital Representations of Oligo- and Polysaccharides 49Stephan Herget and Claus-Wilhelm von der Lieth

4. Evolutionary Considerations in Studying the Sialome: SialicAcids and the Host–Pathogen Interface 69Amanda L. Lewis and Ajit Varki

Section 3: Carbohydrate-active Enzymes and Glycosylation5. Carbohydrate-active Enzymes Database: Principles and

Classification of Glycosyltransferases 91Pedro M. Coutinho, Corinne Rancurel, Mark Stam,Thomas Bernard, Francisco M. Couto,Etienne G. J. Danchin and Bernard Henrissat

6. Other Databases Providing Glycoenzyme Data 119Thomas Lutteke and Claus-Wilhelm von der Lieth

7. Bioinformatics Analysis of Glycan Structures from aGenomic Perspective 125Kiyoko F. Aoki-Kinoshita and Minoru Kanehisa

8. Glycosylation of Proteins 143Claus-Wilhelm von der Lieth and Thomas Lutteke

Page 8: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

vi Contents

9. Prediction of Glycosylation Sites in Proteins 163Karin Julenius, Morten B. Johansen, Yu Zhang,Søren Brunak and Ramneek Gupta

Section 4: Experimental Methods – Bioinformatic Requirements10. Experimental Methods for the Analysis of Glycans and Their

Bioinformatics Requirements 195Claus-Wilhelm von der Lieth

11. Analysis of N- and O-Glycans of Glycoproteins byHPLC Technology 203Anthony H. Merry and Sviatlana A. Astrautsova

12. Glycomic Mass Spectrometric Analysis and DataInterpretation Tools 223Niclas G. Karlsson and Nicolle H. Packer

13. Software Tools for Semi-automatic Interpretation of MassSpectra of Glycans 257Kai Maass and Alessio Ceroni

14. Informatics Concepts to Decode Structure-FunctionRelationships of Glycosaminoglycans 269Rahul Raman, S. Raguram and Ram Sasisekharan

15. NMR Databases and Tools for Automatic Interpretation ofSpectra of Carbohydrates 295Claus-Wilhelm von der Lieth

16. Automatic Spectrum Interpretation Based on IncrementRules: CASPER 311Roland Stenutz

17. Interpretation of 13C NMR Spectra by Artificial Neural NetworkTechniques (NeuroCarb) 321Andreas Stoeckli, Matthias Studer, Brian Cutting andBeat Ernst

Section 5: 3D Structures of Complex Carbohydrates18. Conformational Analysis of Carbohydrates – A

Historical Overview 337Martin Frank

19. Predicting Carbohydrate 3D Structures UsingTheoretical Methods 359Martin Frank

20. Synergy of Computational and Experimental Methods inCarbohydrate 3D Structure Determination and Validation 389Thomas Lutteke and Martin Frank

Page 9: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Contents vii

Section 6: Protein–Carbohydrate Interaction21. Structural Features of Lectins and Their Binding Sites 415

Remy Loris

22. Statistical Analysis of Protein–Carbohydrate ComplexesContained in the PDB 433Thomas Lutteke and Claus-Wilhelm von der Lieth

Section 7: AppendicesAppendix 1: List of Available Websites 449Appendix 2: Glossary 453

Index 461

Page 10: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Page 11: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

List of Contributors

Kiyoko F. Aoki-KinoshitaDepartment of Bioinformatics, Faculty of Engineering, Soka University, Tokyo 192-8577,Japan

Sviatlana A. AstrautsovaDepartment of Microbiology, Medical University of Grodno, Grodno, Belarus

Thomas BernardArchitecture et Fonction des Macromolecules Biologiques, UMR6098, CNRS, Universitesd’Aix-Marseille I & II, 13402 Marseille cedex 20, France

Søren BrunakCenter for Biological Sequence Analysis, BioCentrum-DTU, Technical University ofDenmark, 2800 Lyngby, Denmark

Alessio CeroniDivision of Molecular Biosciences, Imperial College London, London SW7 2AZ, UK

Pedro M. CoutinhoArchitecture et Fonction des Macromolecules Biologiques, UMR6098, CNRS, Universitesd’Aix-Marseille I & II, 13402 Marseille cedex 20, France

Francisco M. CoutoDepartamento de Informatica, Faculdade de Ciencias, Universidade de Lisboa, 1749-016Lisbon, Portugal

Brian CuttingInstitute of Molecular Pharmacy, Pharmacenter of the University of Basel, 4056 Basel,Switzerland

Etienne G. J. DanchinArchitecture et Fonction des Macromolecules Biologiques, UMR6098, CNRS, Universitesd’Aix-Marseille I & II, 13402 Marseille cedex 20, France

Beat ErnstInstitute of Molecular Pharmacy, Pharmacenter of the University of Basel, 4056 Basel,Switzerland

Martin FrankDeutsches Krebsforschungszentrum (German Cancer Research Center), Molecular Struc-ture Analysis Core Facility – W160, 69120 Heidelberg, Germany

Page 12: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

x List of Contributors

Ramneek GuptaCenter for Biological Sequence Analysis, BioCentrum-DTU, Technical University ofDenmark, 2800 Lyngby, Denmark

Bernard HenrissatArchitecture et Fonction des Macromolecules Biologiques, UMR6098, CNRS, Universitesd’Aix-Marseille I & II, 13402 Marseille cedex 20, France

Stephan HergetDeutsches Krebsforschungszentrum (German Cancer Research Center), Molecular Struc-ture Analysis Core Facility – W160, 69120 Heidelberg, Germany

Morten B. JohansenCenter for Biological Sequence Analysis, BioCentrum-DTU, Technical University ofDenmark, 2800 Lyngby, Denmark

Karin JuleniusDivision of Matrix Biology, Department of Medical Biochemistry and Biophysics,Karolinska Institutet, 17177 Stockholm, and Stockholm Bioinformatics Center, SCFAB,Stockholm University, 10691 Stockholm, Sweden

Minoru KanehisaBioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011,Japan

Niclas G. KarlssonCentre for BioAnalytical Sciences, Chemistry Department, NUI Galway, Galway,Ireland

Amanda L. LewisGlycobiology Research and Training Center, Departments of Medicine, Biological Sciencesand Cellular and Molecular Medicine, University of California at San Diego, La Jolla, CA92093, USA

Remy LorisStructural Biology Brussels, Vrije Universiteit Brussel and Department of Molecular andCellular Interactions, VIB, Pleinlaan 2, B-1050 Brussels, Belgium

Thomas LuttekeFaculty of Veterinary Medicine, Institute of Biochemistry and Endocrinology, Justus-LiebigUniversity Gießen, 35392 Gießen, Germany

Kai MaassInstitute of Biochemistry, University of Gießen, 35392 Gießen, Germany

Anthony H. MerryGlycosciences Consultancy, Charlbury OX7 3HB, UK

Page 13: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

List of Contributors xi

Nicolle H. PackerBiomolecular Frontiers Research Centre, Department of Chemistry and Biomolecular Sci-ences, Macquarie University, North Ryde, Sydney, NSW 2109, Australia

S. RaguramBiological Engineering Division, Harvard-MIT Division of Health Sciences and Tech-nology, Center for Biomedical Engineering, Center for Environmental Health Sciences,Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Rahul RamanBiological Engineering Division, Harvard-MIT Division of Health Sciences and Tech-nology, Center for Biomedical Engineering, Center for Environmental Health Sciences,Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Corinne RancurelArchitecture et Fonction des Macromolecules Biologiques, UMR6098, CNRS, Universitesd’Aix-Marseille I & II, 13402 Marseille cedex 20, France

Rene RanzingerDeutsches Krebsforschungszentrum (German Cancer Research Center), Molecular Struc-ture Analysis Core Facility – W160, 69120 Heidelberg, Germany

Ram SasisekharanBiological Engineering Division, Harvard-MIT Division of Health Sciences and Tech-nology, Center for Biomedical Engineering, Center for Environmental Health Sciences,Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Mark StamArchitecture et Fonction des Macromolecules Biologiques, UMR6098, CNRS, Universitesd’Aix-Marseille I & II, 13402 Marseille cedex 20, France

Roland StenutzDepartment of Organic Chemistry, Stockholm University, 106 91 Stockholm, Sweden

Andreas StoeckliInstitute of Molecular Pharmacy, Pharmacenter of the University of Basel, 4056 Basel,Switzerland

Matthias StuderInstitute of Molecular Pharmacy, Pharmacenter of the University of Basel, 4056 Basel,Switzerland

Robin ThomsonInstitute for Glycomics, Griffith University - Gold Coast Campus, Queensland 4222,Australia

Ajit VarkiGlycobiology Research and Training Center, Departments of Medicine, Biological Sciencesand Cellular and Molecular Medicine, University of California at San Diego, La Jolla, CA92093, USA

Page 14: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

xii List of Contributors

Claus-Wilhelm von der LiethFormerly at Deutsches Krebsforschungszentrum (German Cancer Research Center), Molec-ular Structure Analysis Core Facility – W160, 69120 Heidelberg, Germany

Yu ZhangCenter for Biological Sequence Analysis, BioCentrum-DTU, Technical University ofDenmark, 2800 Lyngby, Denmark

Page 15: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Dr Claus-Wilhelm “Willi” von der Lieth1949–2007On November 16, 2007, Dr Claus–Wilhelm (Willi) von derLieth – the ‘father’ of this book – passed away unexpectedly atthe age of 58. He is greatly missed as a friend and colleague.

Willi was born in 1949 in Bremervorde, Germany. He stud-ied chemistry at the Technical University of Hannover where hereceived his Diploma in 1977. He continued his studies in theo-retical chemistry at the University of Heidelberg and receivedhis doctoral degree in 1980. In the same year he joined theGerman Cancer Research Center (DKFZ Heidelberg) wherehe became the head of the Molecular Modeling group within the Central SpectroscopyDepartment in 1987. At the DKFZ, he developed a computer-based spectroscopic informa-tion system and provided a wide variety of modeling services for many years. In the late1980s Willi entered the carbohydrate field through the application of modeling methods tosupport conformational analysis of complex oligosaccharides by NMR. Being a visionary,Willi realized that the Internet offered an opportunity to provide, without barriers, scientificinformation and tools to a large community. This led in the 1990s to the first web-basedmolecular builder for carbohydrate 3D structures (SWEET), which is still used by scientiststoday. In the late 1990s Willi initiated the SWEET-DB project which aimed to use modernweb techniques to make existing carbohydrate-related data collections available over theInternet, and to create an interface that allowed glycoscientists to find important data forcompounds of interest in a compact and well-structured representation. The data sourcesof SWEET-DB were structures and literature references from the – at that time discon-tinued – Complex Carbohydrate Structure Database (CarbBank), NMR data taken fromSugaBase, and 3D co-ordinates generated with SWEET-II. An automated link to the NCBIPubMed service was established which allowed scientists to find literature for a particularcarbohydrate structure very easily. Over the years more services and tools were developedwhich ultimately led to the GLYCOSCIENCES.de web portal, which is currently one ofthe largest scientific resources for carbohydrate structure related data.

Willi was internationally recognized as a pioneer, and a global leader, in the field ofglycoinformatics. In recent years Willi was very active in international efforts to integrateand crosslink existing carbohydrate databases based on the philosophy of open access.He was Coordinator of the EUROCarbDB project (an EU funded design study to createthe foundations of databases and bioinformatics tools for glycobiology and glycomics),a co-director of HGPI/HUPO (the Human Disease Glycomics/Proteome Initiative), and amember of the US Consortium for Functional Glycomics.

Willi served as Treasurer of the German Chapter of the Molecular Graphics Societyand as Referee for several journals, was a member of the Editorial Board of CarbohydrateResearch, and contributed over 100 journal articles and book chapters in the fields ofmolecular modeling, computer-based information systems, and glycoscience.

Willi’s sudden death was a great professional and personal loss for all those who had theprivilege to work with him. Willi was an exceptional personality; his boundless enthusiasm,creativity, and active interest in so many research areas was contagious. His uncomplicatedinteractions with colleagues and students revealed his openness, modesty and generosity.

This book is dedicated to his memory.

Page 16: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Page 17: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Preface

Why Is It Timely to Publish a Book on (Bio)informatics forGlycobiology and Glycomics?

The four essential molecular building blocks of cells are nucleic acids, proteins, lipids,and carbohydrates, often also referred to as glycans. Nucleotide and protein sequencesare the heart of nearly all bioinformatics applications and research, whereas glycan andlipid structures have been widely neglected. Glycans are the most abundant and diverseof Nature’s biopolymers. Complex carbohydrates are often covalently attached to proteinsand lipids and thus constitute a significant amount of the mass and structural variation inbiological systems. The field of glycobiology is focused upon understanding the structure,chemistry, biosynthesis, and biological function of glycans and their derivatives.

It has long been known that carbohydrates encode biological information. For example,it was shown already in 1952 that variation of blood group determinants is a consequenceof glycosylation, that is, the addition of complex carbohydrates to proteins and lipids.Thus, the chemistry, biochemistry, and biology of carbohydrates were prominent areas ofresearch during the beginning and the middle of the last century. Nevertheless, determiningthe biological consequences of glycosylation has been extremely difficult. In an editorialarticle in the March 2001 edition of Science devoted to glycobiology, it was described asa “Cinderella field” of research. This means that it is “an area (of research) that involvesmuch work but, alas, does not get to show off at the ball with her cousins, the genomes andproteins.”

With the awareness that the human genome encodes for a significantly smaller number ofgenes than estimated from genomes of lower organisms such as yeast [1], it became obviousthat each gene can be used in different ways depending on how it is regulated. Consequently,the study of post-translational protein modifications, which can alter the functions ofproteins, has entered an era of renaissance and come increasingly into the scientific focus.Glycobiology research has attracted increasing attention because glycosylation is the mostcomplex and most frequently occurring post-translational modification. Similar to thedevelopments in genomics and proteomics, high-throughput glycomics projects to decipherthe role of carbohydrates in health and disease are emerging [2–9]. With the increasingamount of experimental data, the need to develop appropriate glycan related databases andbioinformatics tools is obvious. However, until recently, informatics have been only poorlyrepresented in glycobiology [10–12].

Unfortunately, most of the tools and applications developed in bioinformatics for thedescription and analysis of DNA (RNA) and protein sequences cannot be directly applied tocarbohydrates. This is mainly due to the fact that oligosaccharides can exhibit various waysto link together their building blocks, the monosaccharides. In nature, monosaccharides are

Page 18: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

xvi Preface

found which are connected to up to four others. This has the consequence that branchedstructures can be formed. Such structures can no longer be described as linear sequences, butrather need to be described as topologies of specifically connected building blocks. In thisrespect, the encoding of complex carbohydrates based on building blocks is more similarto the digital description of organic molecules, which are described in chemoinformaticsthrough the topology of atoms.

As with proteomics, mass spectrometry (MS) has become a key technology for theidentification of glycans. The experimental procedures in MS-based glycomics analysis aresimilar to those applied in proteomics. However, until recently, no efficient software toolswere available to interpret the spectra. Therefore, the annotation of spectra and assignmentof glycan structures to the mass peaks had to be done manually by an expert.

The lack of efficient automatic assignment procedures is still the major bottleneck foran automatic high-throughput analysis of glycans in glycomics projects. Consequently,several computational attempts to overcome this unfavorable situation have been publishedin recent years. The developed algorithms were mainly implemented by experimentalistsfocused on solving the specific needs of their experimental setup and scientific questions.However, it is obvious that glycomics calls for more general solutions, highly sophisticatedalgorithmic approaches, and standardization.

Since the beginning of this century, a small but rather active community of researchersemerged with the aim of working out the foundations of the informatics for glycobiology andglycomics. The development and use of informatics tools and databases for glycobiologyand glycomics research have thus increased considerably in recent years. However, this fieldmust still be considered as being in its infancy compared with genomics and proteomics.

It is the aim of this book to give an introduction to this emerging field of sciencefor the experimentalist working in glycobiology and glycomics, and also for the computerscientists looking for new challenges in the development of highly sophisticated algorithmicapproaches.

Glycomics: an Exotic and Somewhat Forgotten Area of Bioinformatics

The European Bioinformatics Institute (EBI) (www.ebi.ac.uk/) [13, 14] is one of the largestcenters world-wide for research and services in bioinformatics, managing a broad range offreely available databases of biological sequences, information, and knowledge. However,in 2007, the EBI did not provide access to any collection containing glycan structures. TheUS National Center for Biotechnology Information (NCBI) [15] (www.ncbi.nlm.nih.gov/)provides the PubChem service (pubchem.ncbi.nlm.nih.gov/), a fairly new service con-taining information on the biological activities of small molecules, which also includescarbohydrate structures. However, the main focus of PubChem is to provide access to gly-cans that can be used as chemical probes or ligands to study the functions of genes, cells, andbiochemical pathways. The Japanese Kyoto Encyclopedia of Genes and Genomes (KEGG)(www.genome.ad.jp/) [16] is a suite of databases and associated software that aims to inte-grate the current knowledge on molecular interaction networks in biological processes withthe information about the universe of genes and proteins, in addition to information aboutthe universe of chemical compounds and reactions. KEGG contains a GLYCAN databasewith about 11 000 structures [17]. Most of the data were taken from CarbBank [18, 19], thelargest attempt to build up a comprehensive collection of all known carbohydrate structures

Page 19: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Preface xvii

during the 1980s and 1990s (see also the paragraph on the history of glyco-related databasesin Chapter 1). KEGG is currently the most developed project which links glycan structureswith available proteomics and glycomics data though biosynthetic pathways.

The Bioinformatics Links Directory (www.bioinformatics.ca/links directory/) [20, 21],an online community resource that contains a directory of freely available tools, databases,and resources for bioinformatics and molecular biology research – which is compiled bycollecting applications which have been published in the annual Nucleic Acids ResearchWeb issues – lists only six tools dealing with the bioinformatics of carbohydrate structures.This is in contrast to the 376 useful resources for DNA sequence analyses reported, and the651 links to useful resources for protein sequence and structure analyses, which includealso tools for phylogenetic analyses, prediction of protein structures and functions, andanalyses of protein–protein interactions. On the other hand, collections of links compiledwith special emphasis on glyco-related web applications (see, e.g., www.eurocarbdb.org)already show more than 60 dedicated websites. This discrepancy reflects the current pointsof view from both sides: while the bioinformatics community widely ignores the exis-tence of macromolecules other than DNAs, RNAs, and proteins, scientists developing soft-ware applications for glycobiology do not regard themselves as part of the bioinformaticscommunity.

The Role of (Bio)informatics in Glycobiology Research

The involvement of (bio)informatics in glycobiology research can be divided into:

1. The application of classical bioinformatics tools to analyze the DNA and protein se-quences, which have a relation to carbohydrates. Such sequences may be the proteinsto which glycans are attached, the enzymes which build or modify oligosaccharides,or the lectins which recognize a certain sugar epitope.

2. Applications and databases where an explicit description of the glycan structure isrequired. All analytical tools to determine glycan structures and structure–functionrelations depend heavily on appropriate encoding of glycan structures.

3. Atomic descriptions of carbohydrate-protein recognition processes in which the spa-tial structures of complex carbohydrates, their conformational preferences, and theirenergetics are analyzed.

In this book, we try to provide a comprehensive overview of all three areas of active research.The chapters are written by active researchers who have made essential contributions to thedevelopment of the field of glycobioinformatics in recent years.

Use of Classical Bioinformatics Databases and Tools

Chapter 1, Glycobiology, Glycomics, and (Bio)Informatics, briefly describes the biologicalrole of carbohydrates, their biosynthesis, and the enzymes which are responsible for thestepwise synthesis of the branched oligosaccharides – the glycosyltransferases. Specialemphasis is placed on the role of bioinformatics in accelerating the identification of humancarbohydrate active enzymes with the help of bioinformatics databases and appropriatealignment tools.

Page 20: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

xviii Preface

Chapter 5, Carbohydrate-active Enzymes Database: Principles and Classification ofGlycosyltransferases, gives a comprehensive overview of the enzymes responsible forthe biosynthesis of the glycosidic bonds in living organisms. The authors develop andmaintain the CAZy database, the world’s most complete resource describing the familiesof structurally related catalytic and carbohydrate-binding modules (or functional domains)of enzymes that degrade, modify, or create glycosidic bonds. The subsequent chapter givesa short overview of the focus of other existing data resources which provide access toglyco-related enzymes.

Chapter 7, Bioinformatics Analysis of Glycan Structures from a Genomic Perspective,describes how informatics can help to elucidate the biological function of glycans, whichare intertwined with the rest of the biological system such as interacting proteins andchemical compounds. In this chapter, an approach is presented based on the data of theKyoto Encyclopedia of Genes and Genomes (KEGG) including a comprehensive glycandata resource called KEGG GLYCAN. It encompasses all aspects of the biological sys-tem, incorporating genomic information integrated with pathways and reactions and alsochemical compounds.

Chapter 8, Glycosylation of Proteins, gives a short introduction to this phenomenon,and Section 8.2 therein, GlySeq: an Analysis of Experimentally Determined OccupiedGlycosylation Sites, describes services which provide access to the respective glyco-relatedexperimental data contained in the Protein Data Bank (PDB) [22] and SWISS-Prot. TheGlySeq service – although focusing on analyzing the data from the viewpoint of carbohy-drates – provides access to both moieties: the carbohydrates and the proteins. Glycosylationis known to affect protein folding, localization, and trafficking, protein solubility, antigenic-ity, biological activity and half-life. Chapter 9, Prediction of Glycosylation Sites in Proteins,summarizes the ideas of pattern recognition for protein glycosylation site prediction frompeptide sequence alone. It provides a general introduction to data-driven prediction methodsto solving this problem, including a discussion on artificial neural networks.

Informatics Applications Where a Special Encoding of GlycanStructures is Required

Due to the structural complexity of complex carbohydrates which can form highly branchedstructures, most of the tools and applications developed in classical bioinformatics for thedescription and analysis of DNA (RNA) and protein sequences cannot be directly appliedto carbohydrates.

Chapter 2, Introduction to Carbohydrate Structure and Diversity, provides a short de-scription of the major types of carbohydrate structural motifs found in nature. The structuraldiversity of carbohydrates that is currently available in databases has been analyzed and isalso presented. As excellent reference books are available summarizing the cellular locationand biological function of the various types of glycan, the Editors decided not to repeat thisinformation here and recommend that readers consult the existing compendia for furtherreading.

Special encoding schemes are required which are able to describe all structural features ofcomplex carbohydrates found in nature. Unfortunately, no standard description existed untilrecently that was capable of coping with all structural features of carbohydrates as needed,for example, for the emerging glycomics projects. Chapter 3, Digital Representations of

Page 21: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Preface xix

Oligo- and Polysaccharides, gives a comprehensive overview of all structural featureswhich have to be encoded to cope with all carbohydrate-specific structural elements. Itdiscusses various often used digital formats, which have so far been suggested and are usedfor specific applications or to encode carbohydrate structures in databases. The chaptergives an outline of future directions of developments.

The introductory part of the book is rounded off by a chapter on evolutionary con-siderations in studying the structural diversity of the most important class of terminalmonosaccharides – the sialic acids. The contribution suggests that the structural diversityof sialic acids reflects the often conflicting pressures of evading pathogens, while simul-taneously maintaining endogenous functions. Since most pathogens replicate much fasterthan their hosts, they can rapidly evolve different ways to target or mimic structures thatare critical for host processes, which may be especially relevant to glycans.

An in-depth understanding of the biological functions of complex carbohydrates requiresa detailed knowledge of all structural features of their primary sequence and also theconformational space that they can access. Analysis of carbohydrates has proved to bedifficult in the past. Fortunately, modern analytical methods have the ability to elucidatemost structural details at the concentration levels required for glycomics projects. However,at present, informatics tools give only limited support to enable an automatic, reliableinterpretation of the vast amount of data recorded by the analytical methods. This deficiencycurrently represents a severe bottleneck for the practical implementation of high-throughputglycomics projects. Therefore, it is not surprising that the development of algorithms andtools to interpret analytical data constitutes the most active field of software design in theglycomics field.

The chapters included in Section IV, Experimental Methods – Bioinformatics Require-ments, summarize the status of analytical procedures used in glycomics and the algorithms,software tools, and services which are available to support the interpretation of data.Similarly to proteomics, MS, HPLC and NMR are the main experimental techniques inglycomics research. However, the concrete experimental procedures applied by variousresearchers vary considerably, so that it is necessary to outline the analytical proceduressince otherwise the informatics requirements would be difficult to explain.

Spatial Structures of Carbohydrates and Atomic Descriptions ofCarbohydrate–Protein Recognition Processes

The elucidation of conformational preferences of complex carbohydrate structures has along history, which dates back to the early 1980s. Chapter 18, Conformational Analysisof Carbohydrates – A Historical Overview, reviews these developments, and is followedby Chapter 19, Predicting Carbohydrate 3D Structures Using Theoretical Methods, wherethe various commonly used theoretical approaches and simulation methods are introducedand their applications to predict 3D structures of carbohydrates are discussed. Section 19.4,Generation of 3D Structures of Glycoproteins, describes a service available through theGLYCOSCIENCES.de portal [23], which generates 3D structures of glycoproteins. Chapter20, Synergy of Computational and Experimental Methods in Carbohydrate 3D StructureDetermination and Validation, describes methods to find and analyze experimental 3Dstructures of carbohydrates in the Protein Data Bank.

Page 22: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

xx Preface

Lectins are carbohydrate-binding proteins or glycoproteins which often recognize aspecific sugar epitope and are thus important for a broad variety of specific recognitionprocesses and signaling events. Crystal structures of members of the different animal andplant lectin families have revealed a wide variety of lectin folds and carbohydrate bindingsite architectures. Despite this large variability, a number of interesting cases of bothconvergent and divergent evolution among plant, animal, and bacterial lectins are noted.These similarities are reviewed in Chapter 21, Structural Features of Lectins and TheirBinding Sites.

Chapter 22, Statistical Analysis of Protein–Carbohydrate Complexes Contained in thePDB, provides a detailed overview of the interactions, which specific carbohydrates orclasses of glycans exhibit with proteins in the available experimentally determined protein–carbohydrate complexes.

Current Status of Informatics for Glycosciences

Recent years have seen a variety of new databases and software tools emerging in the fieldof glycomics. However, the current situation in glycobioinformatics is characterized bythe existence of multiple disconnected and incompatible islands of experimental data, dataresources, and specific applications, managed by various consortia, institutions, or localgroups. For example, no comprehensive carbohydrate data collections similar to thosecurrently available for genomic and proteomic data have been compiled so far. There iscurrently no location where information about all carbohydrates reported in peer-reviewedscientific papers is systematically stored. Procedures (similar to those for protein sequences)have not yet been established for scientists to report the observation of specific glycanstructures in specific environments and to store these observations in a generally accepteddatabase.

These are the reasons why we have chosen not to describe in detail the currentlyavailable glyco-related databases. It can be anticipated that the development of databaseswill be subject to rather rapid changes within the next few years, so that the descriptionsprovided could quickly become obsolete. As compensation, we provide a comprehensivelist of web links to glyco-related databases, services, consortia, and communities. As manyof the listed URLs are maintained by larger consortia and longer lasting bioinformaticsprojects, it is the hope that these will provide a more sustainable solution for this book.

Acknowledgments

The authors thank Dr Robin Thomson, Institute for Glycomics, Griffith University, forcarefully reading the manuscript and many useful suggestions to improve its readability.

Heidelberg, June 2007 Claus-Wilhelm von der LiethMartin Frank

Thomas Lutteke

References

1. International Human Genome Sequencing Consortium: Finishing the euchromatic sequence ofthe human genome. Nature 2004, 431:931–945.

Page 23: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Preface xxi

2. Haslam SM, North SJ, Dell A: Mass spectrometric analysis of N- and O-glycosylation of tissuesand cells. Curr. Opin. Struct. Biol. 2006, 16:584–591.

3. Mechref Y, Novotny MV: Miniaturized separation techniques in glycomic investigations.J. Chromatogr. B: Anal. Technol. Biomed. Life Sci. 2006, 841:65–78.

4. Alvarez RA, Blixt O: Identification of ligand specificities for glycan-binding proteins usingglycan arrays. Methods Enzymol. 2006, 415:292–310.

5. Blixt O, Head S, Mondala T, Scanlan C, Huflejt ME, Alvarez R, Bryan MC, Fazio F, CalareseD, Stevens J, et al.: Printed covalent glycan array for ligand profiling of diverse glycan bindingproteins. Proc. Natl. Acad. Sci. USA 2004, 101:17033–17038.

6. Feizi T, Chai W: Oligosaccharide microarrays to decipher the glyco code. Nat. Rev. Mol. CellBiol. 2004, 5:582–588.

7. Stevens J, Blixt O, Paulson JC, Wilson IA: Glycan microarray technologies: tools to survey hostspecificity of influenza viruses. Nat. Rev. Microbiol. 2006, 4:857–864.

8. Prescher JA, Bertozzi CR: Chemical technologies for probing glycans. Cell 2006, 126:1–4.9. Pratt MR, Bertozzi CR: Synthetic glycopeptides and glycoproteins as tools for biology. Chem.

Soc. Rev. 2005, 34:58–68.10. von der Lieth CW, Bohne-Lang A, Lohmann K, Frank M: Bioinformatics for glycomics: status,

methods, requirements and perspectives. Brief Bioinform. 2004, 5:164–178.11. von der Lieth CW, Lutteke T, Frank M: The role of informatics in glycobiology research with spe-

cial emphasis on automatic interpretation of MS spectra. Biochim. Biophys. Acta 2006, 1760:568–577.

12. von der Lieth CW: An endorsement to create open databases for analytical data of complexcarbohydrates. J. Carbohydr. Chem. 2004, 23:277–297.

13. Brooksbank C, Camon E, Harris MA, Magrane M, Martin MJ, Mulder N, O’Donovan C, Parkin-son H, Tuli MA, Apweiler R, et al.: The European Bioinformatics Institute’s data resources..Nucleic Acids Res. 2003, 31:43–50.

14. Brooksbank C, Cameron G, Thornton J: The European Bioinformatics Institute’s data resources:towards systems biology. Nucleic Acids Res. 2005, 33:D46–D53.

15. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DicuccioM, Edgar R, Federhen S, et al.: Database resources of the National Center for BiotechnologyInformation. Nucleic Acids Res. 2007, 35:D3–D4.

16. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for decipheringthe genome. Nucleic Acids Res. 2004, 32:D277–280.

17. Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita KF, Ueda N, Hamajima M, Kawasaki T,Kanehisa M: KEGG as a glycome informatics resource. Glycobiology 2006, 16:63R–70R.

18. Doubet S, Bock K, Smith D, Darvill A, Albersheim P: The Complex Carbohydrate StructureDatabase. Trends Biochem. Sci. 1989, 14:475–477.

19. Doubet S, Albersheim P: CarbBank. Glycobiology 1992, 2:505.20. Fox JA, Butland SL, McMillan S, Campbell G, Ouellette BF: The Bioinformatics Links Directory:

a compilation of molecular biology web servers. Nucleic Acids Res. 2005, 33:W3–W24.21. Fox JA, McMillan S, Ouellette BF: A compilation of molecular biology web servers: 2006 update

on the Bioinformatics Links Directory. Nucleic Acids Res. 2006, 34:W3–W5.22. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne

PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28:235–242.23. Lutteke T, Bohne-Lang A, Loss A, Gotz T, Frank M, von der Lieth CW: GLYCOCIENCES.de: an

internet portal to support glycomics and glycobiology research. Glycobiology 2006, 16:71R–81R.

Page 24: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTE/OTE/SPH P2: OTEFM JWBK380/von der Lieth September 21, 2009 20:14 Printer Name: Yet to Come

Page 25: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTA/XYZ P2: ABCc01 JWBK380/von der Lieth September 21, 2009 15:31 Printer Name: Yet to Come

Section 1:Introduction

Page 26: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTA/XYZ P2: ABCc01 JWBK380/von der Lieth September 21, 2009 15:31 Printer Name: Yet to Come

Page 27: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTA/XYZ P2: ABCc01 JWBK380/von der Lieth September 21, 2009 15:31 Printer Name: Yet to Come

1Glycobiology, Glycomics and(Bio)Informatics

Claus-Wilhelm von der LiethFormerly at the Central Spectroscopic Unit, Deutsches Krebsforschungszentrum(German Cancer Research Center), 69120 Heidelberg, Germany

1.1 The Role of Carbohydrates in Life Sciences Research

Despite their nearly complete neglect in databases and ‘traditional’ bioinformatics projects,carbohydrates are the most abundant and structurally diverse biopolymers formed in nature.Historically, the chemistry, biochemistry, and biology of carbohydrates were very prominentareas of research over a long period of time during the beginning and the middle of the lastcentury. However, during the initial phase of the development of molecular biology, focusingon DNA, RNA, and proteins, studies of carbohydrates lagged far behind. Among the mainreasons for this were the inherent structural complexity of carbohydrates, the difficulty ineasily determining their structure, the fact that their biosynthesis cannot be directly predictedfrom the DNA template, and that no methods are available to amplify complex carbohydratesequences. The more recent development of a variety of new and highly sensitive analyticaltools for exploring the structures of oligosaccharides and for producing larger amountsof pure complex carbohydrates has opened up a new frontier in molecular biology. Theterm glycobiology, which was introduced in the late 1980s [1], reflects the coming togetherof the traditional disciplines of carbohydrate chemistry and biochemistry, with modernunderstanding of the cellular and molecular biology of complex carbohydrates, whichare often named glycans in this context. The more recently introduced term “glycomics”[2] describes an integrated systems approach to study structure–function relationships ofcomplex carbohydrates – the glycome – produced by an organism such as human or mouse.The glycome can be described as the glycan complement of the cell or tissue as expressedby a genome at a certain time and location. It includes all types of glycoconjugates:glycoproteins, proteoglycans, glycolipids, peptidoglycans, lipopolysaccharides, and so on.The aim of glycomics projects is to create a cell-by-cell catalog of glycosyltransferase(GT) expression and detected glycan structures using high-throughput techniques such asDNA glycogene chips, glycan microarray screening and mass spectrometric (MS) glycanprofiling, combined with efficient bioinformatics tools.

Until recently, the role of complex carbohydrates to function as carriers and/or mediatorsof biological information was a widely neglected and unexplored area in science. However,

Bioinformatics for Glycobiology and Glycomics: an Introduction Edited by Claus-Wilhelm von der Lieth, Thomas Lutteke and Martin FrankC© 2009 John Wiley & Sons, Ltd

3

Page 28: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTA/XYZ P2: ABCc01 JWBK380/von der Lieth September 21, 2009 15:31 Printer Name: Yet to Come

4 Introduction

with the awareness that the human genome encodes for a significantly smaller number ofgenes than was estimated from genomes of lower organisms such as yeast [3], it becameobvious that each gene can be used in a variety of different ways depending on how itis regulated. Consequently, the study of post-translational protein modifications, whichcan alter the functions of proteins, came increasingly into scientific focus. Since then,with glycosylation being the most complex and most frequently occurring co- and post-translational modification, glycobiology research has attracted increasing attention.

About 70% of all sequences deposited in the SWISS-PROT [4] protein sequence data-bank include the potential N-glycosylation consensus sequence Asn–X–Ser/Thr (whereX can be any amino acid except proline) and thus may be glycoproteins. However, it iswell known that not all potential sites are actually glycosylated. Based on an analysis ofwell-annotated and characterized glycoproteins in SWISS-PROT, it was concluded thatmore than half of all proteins are glycosylated [5, 6]. However, this number should be re-garded as a very crude estimation since this study was hampered by the paucity of reliable,experimentally determined, and carefully assigned glycosylation sites.

The glycans are exposed on the surface of biomolecules and cells. They form flexible,branched structures that can extend 30 A or further into the solvent. With a molecular weightof up to 3 kDa each, the oligosaccharide groups of mammalian glycoproteins frequentlymake up a sizable proportion of the mass of a glycoprotein and can cover a large fractionof its surface. The carbohydrate moiety of “proteins” may amount to a few percent ofthe molecular weight, but can be as much as 90% in some cases. O-Linked mucin-typeglycoproteins are usually large (more than 200 kDa) with attached O-glycan chains at a highdensity. As many as one in three amino acids may be glycosylated and 50–80% of the totalmass is due to carbohydrates [7]. An analysis of the available three-dimensional structuresof glycoproteins contained in the PDB [8] revealed that the glycan and the protein parts ofglycoproteins behave like semi-independent moieties. This behavior has several importantbiological consequences:

� N-Glycans can be modified without appreciable effects on the protein. Every N-linkedglycan is subject to extensive modifications. This allows cells to fine-tune the biophys-ical and biological properties of glycoproteins and to generate the microheterogeneity[9] that is so characteristic of glycoproteins.

� The semi-independent nature of glycans also allows cell types and cells in differentstages of differentiation and transformation to imprint on their glycoprotein pooltheir own specific biochemical characteristics, and thus give their exposed surface a“corporate identity.”

� This “corporate identity” [10] exposed on their surface makes cells recognizable toother cells in a multicellular environment. It allows self-recognition and provides acentral theme in development, differentiation, physiology, and disease.

1.2 Glycogenes, Glycoenzymes and Glycan Biosynthesis

The biosynthesis of carbohydrates attached to proteins or to lipids – called glycoconjugates –is fundamentally different to the expression of proteins. Whereas the enzymes requiredfor the translation of the genetic information into a polypeptide chain in the ribosomeare always the same for all proteins and amino acids, the subsequent glycosylation is a

Page 29: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTA/XYZ P2: ABCc01 JWBK380/von der Lieth September 21, 2009 15:31 Printer Name: Yet to Come

1 Glycobiology, Glycomics and (Bio)Informatics 5

non-template-driven process where dozens of different enzymes are involved in the synthe-sis of the sugar chains attached to proteins or lipids. Depending on which of these enzymesare expressed in the cell that synthesizes a glycoprotein, various different glycan chainscan be attached to the protein or lipid. Glycoproteins generally exist as populations ofglycosylated variants – called glycoforms – of a single polypeptide [11, 12]. Although thesame glycosylation machinery is available to all proteins in a given cell, most glycopro-teins emerge with a characteristic glycosylation pattern and heterogeneous populations ofglycans at each glycosylation site.

Glucose and fructose are the major carbon and energy sources for organisms as diverseas yeast and human beings (see, e.g., [7]: Monosaccharide Metabolism chapter). Organismscan derive the other monosaccharides needed for glycoconjugate synthesis from these majorsuppliers. It is important to appreciate that not all of the biosynthetic pathways are equallyactive in all types of cells.

The biosynthesis of oligosaccharides is primarily determined by sequentially acting en-zymes, the glycosyltransferases (GTs), which assemble monosaccharides into linear andbranched sugar chains. For this purpose, the monosaccharides must be either imported intothe cell or derived from other sugars within the cell. However, a common factor is that allglycoconjugate syntheses require activated sugar nucleotide donors. It has long been knownthat a nucleotide triphosphate such as uridine triphosphate (UTP) reacts with a glycosyl-1-P to form a high-energy donor sugar nucleotide that can participate in glycoconjugatesynthesis. Once the sugar nucleotides have been synthesized in the cytosol (or, in the caseof CMP-Neu5Ac, in the cell nucleus), they are topologically translocated, since most gly-cosylation occurs in the endoplasmic reticulum (ER) and Golgi apparatus. As the negativecharge of the sugar nucleotides prevents them from simply diffusing across membranes intothese compartments, eukaryotic cells have devised no-energy-requiring sugar nucleotidetransporters that deliver sugar nucleotides into the lumen of these organelles [7].

1.2.1 Biosynthetic Pathways

In eukaryotes, more than 10 biosynthetic pathways that link glycans to proteins and lipids[13, 14] are known. The KEGG PATHWAY resource [15, 16] – a collection of pathwaymaps representing current biochemical knowledge of the molecular interaction and reactionnetworks – has encoded 18 pathways for the biosynthesis of complex carbohydrates andtheir metabolism (see Figure 1.1), and 20 pathways for metabolism where carbohydrates areinvolved. More than 200 enzymes are involved in the biosynthesis of carbohydrate structuresfound on proteins and lipids. More than 30 different enzymes may participate directly inthe synthesis of a single glycan. One of the best-characterized pathways is the biosynthesisof complex oligosaccharides that are subsequently attached to a protein through the side-chain nitrogen atom of the amino acid aspagarine (Asn) to give glycoproteins [10, 17, 18](described in Section 8.1 in Chapter 8). Glycosylation of proteins occurs in all eukaryotesand in many archaea but only exceptionally in bacteria.

O-Linked glycosylation, where carbohydrates are attached to serine (Ser) and threonine(Thr), takes place post-translationally in the Golgi apparatus. The monosaccharides areadded one by one in a stepwise series of reactions (Figure 1.2). This is in contrast to theN-linked glycosylation pathway where a preformed oligosaccharide is transferred en blocto Asn. A second important difference is that there are no known consensus sequence

Page 30: Bioinformatics for Glycobiology and Glycomics: an Introduction · Bioinformatics for glycobiology and glycomics : an introduction / edited by Claus-Wilhelm von der Lieth, Thomas L¨utteke,

P1: OTA/XYZ P2: ABCc01 JWBK380/von der Lieth September 21, 2009 15:31 Printer Name: Yet to Come

6 Introduction

GLYCAN BIOSYNTHESISAND METABOLISM

Glycosylphosphatidyl-inositol(GPI)-anchor

biosysthesis

O-Glycansbiosynthesis

Chondroitin / Heparansulfate biosynthesis

Glycosaminoglycandegradation

Keratansulfatebiosynthesis

Lipopolysaccharidebiosynthesis

Peptidoglycan biosynthesis

Blood group glycolipidbiosynthesis-neo-lactoseries

Ganglioside biosynthesis

01170 6/16/04

Globosidemetabolism

Blood groupglycolipidbiosynthesis-lactoseries

Glyco-sphingolipidmetabolism

N-Glycandegradation

N-Glycansbiosynthesis

Figure 1.1 Illustration of the pathways for the biosynthesis of complex carbohydrates and theirmetabolism encoded in KEGG PATHWAY [15, 16] available at: www.genome.jp/kegg/pathway/map/map01170.html.