STATISTICAL TECHNIQUES FOR BIOLOGICAL MOTIF DISCOVERYniranjan/papers/NagarajanThesis07.pdf · STATISTICAL TECHNIQUES FOR BIOLOGICAL MOTIF DISCOVERY Niranjan Nagarajan, Ph.D. Cornell

STATISTICAL TECHNIQUES FOR BIOLOGICAL

MOTIF DISCOVERY

A Dissertation

Presented to the Faculty of the Graduate School

of Cornell University

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

by

Niranjan Nagarajan

January 2007

c© 2007 Niranjan Nagarajan

ALL RIGHTS RESERVED

STATISTICAL TECHNIQUES FOR BIOLOGICAL MOTIF DISCOVERY

Niranjan Nagarajan, Ph.D.

Cornell University 2007

In recent years, the various genome sequencing projects and computational and

experimental efforts to find genes have provided us with a wealth of sequence

information in protein and DNA databases. A large portion of this sequence data

is however yet to be characterized. Experimental efforts and manual curation have

tried to keep up with the flood of data, but it has become increasingly clear that

reliable computational methods are required to fill in the gap. In addition to its

value in furthering research in basic biology, improved computational tools for

annotating Proteomes and Genomes serve as an important first step in realizing

the biomedical promise of whole-cell modelling and systems biology.

In this dissertation we discuss statistical and algorithmic techniques for two

important areas in the field of biological sequence analysis. We begin by discussing

our work on improving a class of motif finding tools that are widely used to discover

regulatory signals in DNA. This work is based on new ideas in computational

statistics that provide us with efficient and accurate tools for the analysis of motif

significance. These tools make it feasible to incorporate a statistical score in motif

finding algorithms and we show experimentally that this new approach can give

rise to significantly more sensitive motif finders.

In the rest of this dissertation we discuss a new machine learning based ap-

proach for predicting conserved functional and structural units (or domains) in

proteins. Finding domains in proteins is an important step for the classification

and study of proteins and their role in interaction networks. Our proposed frame-

work learns an expert definition of protein domains (to accurately model this con-

cept) while avoiding the heuristic rules prevelant in earlier methods. Results from

experiments on a large set of protein sequences validate the improved accuracy

and coverage of our approach.

BIOGRAPHICAL SKETCH

Niranjan Nagarajan was born on November 1st 1978 in Jakarta, Indonesia. His

early school years were spent in South Town School, New Delhi, followed by three

memorable years in Kathmandu, Nepal. Niranjan did his 10th class CBSE exami-

nations in Vidya Mandir (Adayar) in Chennai and his International Baccalaureate

examinations in the International School of Paris. He then attended Ohio Wes-

leyan University and graduated summa cum laude in May 2000 with a Bachelor

of Arts in Mathematics and Computer Science. In August of 2000, Niranjan en-

rolled in the Ph.D. program in the Department of Computer Science at Cornell

University. He received a Ph.D. in Computer Science in January of 2007.

iii

This work is dedicated to Appa and Amma.

iv

ACKNOWLEDGEMENTS

My life and research at Cornell and its conclusion in the form of this dissertation

are indebted to several people. First and foremost, this research would not have

been possible without my advisor Dr. Uri Keich. I thank him for introducing me

to this area of research, showing me the ropes and being patient when I fell of it.

It is through him that over time I have learnt to be more critical about my own

ideas and be suspicious when surprising results pop up. In my research, I hope to

continue emulating his ability to be clear, concise and to the point and have his

distaste for “science fiction”.

I would also like to express my gratitude to Dr. Golan Yona for mentoring me

in the early years of my Ph.D. and directing my research on protein domains. In

addition, Dr. Jon Kleinberg and Dr. Ron Elber were gracious enough to be on my

committee and provided valuable suggestions for my research and this dissertation.

Dr. Eva Tardos and Dr. Joe Halpern played a crucial role in helping me get through

graduate school and I cannot thank them enough.

Cornell University and the Department of Computer Science formed the perfect

setting for my doctoral work. I am grateful to all the professors here who imparted

their knowledge to me in and out of class. My only regret is that I didn’t spend

more of time taking courses and interacting with the faculty here. I would not

have been in Cornell if not for Dr. Alan Zaring and Dr. Jeffrey Nunemacher at

Ohio Wesleyan University. Thank you for being such wonderful teachers. I am

still amazed at how fortunate I have been.

My collaborators, Patrick and Neil, deserve my thanks for generously shar-

ing their ideas and code with me. My current and past officemates, in particu-

lar Biswanath Panda, Abhinandan Das and Venugopalan Ramasubramanian were

v

great sounding boards and it was fun to discuss research and trivia with them.

Cornell would not have been the wonderful experience that it was without the nu-

merous friends that I have been fortunate to have here. Bhargavi, Panda, Yasho,

Chandu and Manish, thank you for your delightful company on numerous occa-

sions and for feeding me so often! Pankaj and Meenakshi, Chandra, Vidya and

Karthick, see you on the badminton courts soon. Also, my respects to the spring

lane gang (Leonid, Allie, Eric, Bjoern, Greg and Elliot) and my housemates Dan

and Ivan.

I was fortunate enough to have family in Ithaca. Thank you Simone and Pedro

(and pi and yasho) for adopting me and advising, comforting and nourishing me.

My parents made me what I am. I can never thank you enough for all that you

have done. I can only hope that I bring you some pride and joy.

Finally, I should acknowledge my partner in crime (to whom any comments or

objections to this dissertation should be addressed) Ishani Mukherjee. She shares

equal responsibility for my life at Cornell and possibily all the credit.

vi

TABLE OF CONTENTS

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Dissertation organization and contributions . . . . . . . . . . . . . . 3

Bibliography 6

2 Robust methods for multinomial goodness-of-fit test 82.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Motivation from bioinformatics . . . . . . . . . . . . . . . . . . . . 112.3 Baglivo et al.’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Error control using shifted-FFT . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Choosing θ . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Improving the runtime . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Analysis of the convolution error . . . . . . . . . . . . . . . 242.5.2 An illustration of the bagFFT algorithm . . . . . . . . . . . 28

2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.6.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.6.2 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.7 Recovering the entire pmf and its application . . . . . . . . . . . . . 372.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 43

Bibliography 44

3 Computing the significance of an ungapped local alignment 463.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.1 The Shifted-FFT (sFFT) algorithm . . . . . . . . . . . . . . 523.2.2 The Cyclic Shifted-FFT (csFFT) algorithm . . . . . . . . . 573.2.3 Boosting θ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.3.1 Runtime characterization . . . . . . . . . . . . . . . . . . . . 633.3.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 643.3.3 Stitching LD and csFFT . . . . . . . . . . . . . . . . . . . . 65

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Bibliography 68

vii

4 Refining motif finders with E-value calculations 694.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2 Efficiently computing E-values . . . . . . . . . . . . . . . . . . . . . 714.3 Optimizing for E-values - Conspv . . . . . . . . . . . . . . . . . . . 744.4 E-value based improvements of the Gibbs sampler . . . . . . . . . . 774.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Bibliography 89

5 Sequence-based domain prediction 915.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1.1 Related studies . . . . . . . . . . . . . . . . . . . . . . . . . 925.1.1.1 Methods based on similarity search . . . . . . . . . 935.1.1.2 Methods based on expert knowledge . . . . . . . . 955.1.1.3 Methods that use predicted 3D information . . . . 955.1.1.4 Methods based on multiple alignments . . . . . . . 965.1.1.5 Other methods . . . . . . . . . . . . . . . . . . . . 96

5.1.2 The current status . . . . . . . . . . . . . . . . . . . . . . . 975.1.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . 975.1.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.2.1 The data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2.1.1 The query data set . . . . . . . . . . . . . . . . . . 995.2.1.2 Alignments . . . . . . . . . . . . . . . . . . . . . . 1015.2.1.3 Domain definitions . . . . . . . . . . . . . . . . . . 102

5.2.2 The domain-information of an alignment column . . . . . . . 1035.2.2.1 Conservation measures . . . . . . . . . . . . . . . . 1045.2.2.2 Consistency and correlation measures . . . . . . . . 1065.2.2.3 Measures of structural flexibility . . . . . . . . . . 1095.2.2.4 Residue type based measures . . . . . . . . . . . . 1125.2.2.5 Predicted secondary structure information . . . . . 1135.2.2.6 Intron-exon data . . . . . . . . . . . . . . . . . . . 114

5.2.3 Score refinement and normalization . . . . . . . . . . . . . . 1155.2.4 Maximizing the information content of scores . . . . . . . . 1155.2.5 The learning model . . . . . . . . . . . . . . . . . . . . . . . 1205.2.6 Hypothesis evaluation . . . . . . . . . . . . . . . . . . . . . 125

5.2.6.1 The domain-generator model . . . . . . . . . . . . 1285.2.6.2 The simple model . . . . . . . . . . . . . . . . . . . 1365.2.6.3 The independence index . . . . . . . . . . . . . . . 136

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1375.3.1 Inclusion of structural information in prediction . . . . . . . 1445.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1465.3.3 Suggested novel partitions . . . . . . . . . . . . . . . . . . . 149

viii

5.3.4 Analysis of errors . . . . . . . . . . . . . . . . . . . . . . . . 1525.3.5 Consistency of domain predictions . . . . . . . . . . . . . . . 1565.3.6 The distribution of domain lengths . . . . . . . . . . . . . . 160

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1615.5 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Bibliography 165

6 Future Work 1696.1 Extensions to the bagFFT algorithm . . . . . . . . . . . . . . . . . 1696.2 Alignment significance in alternate models . . . . . . . . . . . . . . 1696.3 Improvements to Conspv and Gibbspv . . . . . . . . . . . . . . . . 1706.4 Improved protein domain delineation . . . . . . . . . . . . . . . . . 170

Bibliography 173

ix

LIST OF TABLES

2.1 Range of parameters for testing bagFFT . . . . . . . . . . . . . . 312.2 Runtime in seconds for various parameter values . . . . . . . . . . 372.3 Range of parameters for testing bag-sFFT . . . . . . . . . . . . . . 42

3.1 Range of test parameters . . . . . . . . . . . . . . . . . . . . . . . 653.2 Runtime comparison between csFFT and LD . . . . . . . . . . . . 66

4.1 The advantage of using memo-sFFT . . . . . . . . . . . . . . . . . 754.2 Tests on sequences of varied length . . . . . . . . . . . . . . . . . . 764.3 Comparison of CONSENSUS based motif finders . . . . . . . . . . 794.4 Comparison of Gibbs samplers . . . . . . . . . . . . . . . . . . . . 804.5 Comparison of Gibbspv with MEME and GLAM . . . . . . . . . . 834.6 The profiles used in our experiments . . . . . . . . . . . . . . . . . 864.7 The parameter sets used in our experiments . . . . . . . . . . . . . 874.8 Experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.1 Jensen-Shannon (JS) divergence for top ten scores . . . . . . . . . 1165.2 Most correlated score pairs. . . . . . . . . . . . . . . . . . . . . . . 1195.3 Most anti-correlated score pairs. . . . . . . . . . . . . . . . . . . . 1205.4 Ranges for parameters in network training . . . . . . . . . . . . . . 1225.5 A sample from the set of selected networks . . . . . . . . . . . . . 1265.6 Performance evaluation results for the two post-processing methods 1425.7 Performance evaluation results for sequence based methods . . . . 1435.8 Global consistency results . . . . . . . . . . . . . . . . . . . . . . . 1455.9 Performance evaluation results when structural information is used 1465.10 Global consistency results when structural information is used . . . 1465.11 Performance evaluation results using domain definitions in CATH . 157

x

LIST OF FIGURES

2.1 Inaccuracy of the χ2 approximation. . . . . . . . . . . . . . . . . . 92.2 The destructive effects of numerical roundoff errors in FFT . . . . 152.3 How can an exponential shift help? . . . . . . . . . . . . . . . . . . 172.4 Numerical errors in estimating pθ with θ = 1 . . . . . . . . . . . . 202.5 The bagFFT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 292.6 Graphical illustration of the bagFFT algorithm . . . . . . . . . . . 302.7 Accuracy of bagFFT as a function of N, K and Q . . . . . . . . . . 342.8 Practicality of (2.20) for estimating the error in pθ . . . . . . . . . 352.9 Runtime comparison of bagFFT and Hirji’s algorithm . . . . . . . 362.10 Runtime comparison of bagFFT and Hirji (without pruning) . . . . 392.11 The bag-sFFT algorithm . . . . . . . . . . . . . . . . . . . . . . . 40

3.1 A comparison of MEME E-values to CONSENSUS E-values . . . . 493.2 Graph of log10(LD(s)/NC(s)) . . . . . . . . . . . . . . . . . . . . 503.3 Runtime comparison for versions of Hirji’s algorithm and bagFFT . 553.4 Runtime comparison of shifted-Hirji and bagFFT for A = 20 . . . . 563.5 The sFFT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 573.6 The shifted pmf is 0 for much of the valid values of s . . . . . . . . 583.7 The csFFT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 623.8 Average values of L′ versus L and N . . . . . . . . . . . . . . . . . 64

4.1 The memo-sFFT algorithm . . . . . . . . . . . . . . . . . . . . . . 734.2 Performance of CONSENSUS based motif finders . . . . . . . . . . 784.3 Performance of Gibbs samplers . . . . . . . . . . . . . . . . . . . . 81

5.1 Overview of our domain prediction system . . . . . . . . . . . . . . 1005.2 Domain and boundary positions . . . . . . . . . . . . . . . . . . . 1035.3 Consistency measures . . . . . . . . . . . . . . . . . . . . . . . . . 1055.4 Correlation measures . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.5 Predicted contact profile . . . . . . . . . . . . . . . . . . . . . . . . 1115.6 Distributions of scores . . . . . . . . . . . . . . . . . . . . . . . . . 1185.7 Performance of networks as a function of the features used . . . . . 1235.8 Performance of networks as a function of various parameters . . . . 1245.9 Selecting candidate transition points . . . . . . . . . . . . . . . . . 1275.10 Distributions of domain lengths . . . . . . . . . . . . . . . . . . . . 1305.11 Distributions of number of domains . . . . . . . . . . . . . . . . . . 1325.12 Coverage vs. Selectivity for final set of networks . . . . . . . . . . . 1405.13 Coverage vs. Selectivity tradeoff while varying the threshold . . . . 1415.14 Domain definitions for 1qpb . . . . . . . . . . . . . . . . . . . . . . 1485.15 Domain definitions for 1gh8 . . . . . . . . . . . . . . . . . . . . . . 1505.16 Domain definitions for 1acc . . . . . . . . . . . . . . . . . . . . . . 1515.17 Domain definitions for 1ffv . . . . . . . . . . . . . . . . . . . . . . 1535.18 Domain definitions for 1qkc . . . . . . . . . . . . . . . . . . . . . . 155

xi

5.19 Domain definitions for 1i6v . . . . . . . . . . . . . . . . . . . . . . 1565.20 Domain definitions for 1ekx . . . . . . . . . . . . . . . . . . . . . . 159

xii

CHAPTER 1

INTRODUCTION

1.1 Motivation

Computational Biology and the increasing availability of an array of high through-

put data sources are transforming research in the field of Biology, with corre-

sponding benefits in the Biomedical Sciences. From a discipline that was largely

focussed on small-scale experiments and detailed understanding of specific pro-

cesses and pathways there has been an increasing move to understand and model

whole cells and organisms [Glocker et al., 2006, Hood et al., 2004, Kitano, 2002,

Weston and Hood, 2004]. Computational tools for sequence analysis have played

a vital and ubiquitous role in furthering this process. From characterizing protein

features, functional sites and interaction partners to deciphering the meaning of

a range of functional DNA elements, these tools are essential to a more complete

understanding of the cellular machinery.

The need for better sequence analysis tools has acquired greater urgency with

the availability of a wealth of sequence data from various genome sequencing

projects [Lander et al., 2001, Waterston et al., 2002, CSAC, 2005]. In addition,

the availibility of multiple genomes has allowed for studies across genomes and the

integration of evolutionary models into genome analysis tools [Siepel et al., 2005,

Siddharthan et al., 2005]. Recent studies have shown that while gene-finding is an

important goal in understanding genomic DNA a substantial fraction of functional

DNA lies outside of genes [Levy et al., 2001]. The identification and characteriza-

tion of these non-coding elements is an active area of research where computational

and statistical tools play a significant role [Bailey et al., 2006, Lenhard et al., 2003].

1

2

A popular class of such tools use a “motif finding” formulation to identify func-

tionally important sequences [Tompa et al., 2005]. The input in this situation is

a set of sequences that belong to the same functional family. The goal then is to

identify subsequences that are significantly over-represented and well-conserved.

Motif finding tools have numerous applications such as the search for transcrip-

tion initiation sites, RNA cleavage sites and alternative splicing signals as well as

the study of protein motifs [Lawrence et al., 1993]. Motif finders are however most

commonly designed to identify the binding sites near genes where a class of pro-

teins called transcription factors (TFs) bind and regulate gene expression. Finding

these sites is a slow and expensive process experimentally and motif finders are

popular as a fast and cheap surrogate. Due to its wide applicability there has

been a strong interest in improving motif finding tools. An integral part of these

efforts has been the design of measures for evaluating the significance of discovered

motifs in order to discriminate them from random artifacts of the data. In this

dissertation, we study methods for statistical evaluation of motifs and present new

algorithmic techniques to accurately and efficiently evaluate their significance (see

Chapters 2 and 3). While traditional motif finders use the statistical evaluation

only as a post-processing step, we show that its optimization as a motif-score can

give rise to significantly improved motif finders (see Chapter 4).

While motif finding tools have been used in the study of protein families a

more fundamental sequence analysis step in studying proteins is to identify pro-

tein domains. Protein domains are loosely defined as being subsequences that

are evolutionarily conserved, can fold independently and have a definite func-

tion. Domains are typically considered the building blocks of protein design

and function and their identification plays an important role in the classifica-

3

tion and study of proteins. In recent years, there has been increasing interest

in the use of domain architecture to explain high-throughput protein interac-

tion data and make new computational predictions [Gomez and Rzhetsky, 2002,

Betel et al., 2004, Wojcik and Schchter, 2001, Deng et al., 2002, Pitre et al., 2006].

In this dissertation we present a new approach for domain delineation and provide

experimental evidence to show that it can improve significantly on existing meth-

ods (see Chapter 5).

1.2 Dissertation organization and contributions

While the post-genomics era has created many new opportunities for understand-

ing and modelling whole cells and organisms, improved tools for characterizing

sequences and identifying sequence features serve as an important link to attain

this goal. In this dissertation we focus on two important sequence-motif identifi-

cation problems in computational biology and present tools that further the state

of the art in this area. We begin by studying the motif finding problem and in

Chapter 2 we present an algorithm (bag-sFFT) for efficiently computing the sig-

nificance (p-value) of motifs. This algorithm is two-staged, where the first stage is

based on an algorithm (bagFFT) for computing the significance of goodness-of-fits

tests for multinomial data, which is an important problem in itself. We show that

bagFFT is asymptotically the fastest known exact algorithm for this problem and

performs well in experiments as well. In Chapter 3, we extend the Fast Fourier

Transform based techniques introduced in Chapter 2 to improve the second stage

of bag-sFFT. We also show an improvement to an existing algorithm that is more

efficient for DNA motifs in practice than bagFFT. The resulting algorithm (csFFT)

presents a fast and reliable solution for computing the significance of DNA motifs.

4

This is an important tool in practice because as is shown in this chapter, existing

approximations used in popular motif finders such as MEME and CONSENSUS

can produce very inaccurate results.

In Chapter 4 we explore new applications for the techniques described in Chap-

ter 3 by proposing a paradigm shift in how existing motif finders work. Motif find-

ers such as CONSENSUS and MEME that are classified as profile-model based,

typically optimize the entropy score to efficiently search for motifs. The p-value or

more specifically a related quantity, the E-value, is then used to assign significance

to the optimal reported motifs. This raises the question whether optimizing for E-

values instead of entropy could improve the finders’ ability to detect weak motifs.

We first present an efficient algorithm to accurately compute multiple E-values

which changes the nature of the above question from a hypothetical to a practical

one. Using CONSENSUS- and Gibbs-based finders that incorporate this method

we demonstrate on synthetic data that the answer to our question is positive. In

particular, E-value based optimizations show significant improvement over existing

tools for finding motifs of unknown width.

We switch to the domain prediction problem in Chapter 5 and we describe a

novel method for detecting the domain structure of a protein solely from sequence

information. In contrast to existing methods, our method avoids heuristic rules

and instead uses machine learning techniques to learn an expert definition of pro-

tein domains. Our experimental results, using the domain definitions in SCOP

and CATH, show that this approach improves significantly over the best methods

available, even some of the semi-manual ones, while being fully automatic. We

believe that sequence-based predictions from methods such as ours can also be

used to complement and verify domain partitions based on structural data.

5

Finally, in Chapter 6 we discuss some open questions related to this dissertation

and suggest areas for future work. The main tools and algorithms described in

this thesis are available at http://www.cs.cornell.edu/˜niranjan. Note that for the

convenience of the reader we provide bibliographies at the end of each chapter.

6

BIBLIOGRAPHY

[Bailey et al., 2006] Bailey,P.J., Klos,J.M., Andersson,E., Karln,M., Kllstrm,M.,Ponjavic,J., Muhr,J., Lenhard,B., Sandelin,A. and Ericson,J. (2006) A globalgenomic transcriptional code associated with CNS-expressed genes. Exp CellRes, 312 (16), 3108–3119.

[Betel et al., 2004] Betel,D., Isserlin,R. and Hogue,C.W.V. (2004) Analysis of do-main correlations in yeast protein complexes. Bioinformatics, 20 Suppl 1,I55–I62.

[Deng et al., 2002] Deng,M., Mehta,S., Sun,F. and Chen,T. (2002) Inferringdomain-domain interactions from protein-protein interactions. Genome Res,12 (10), 1540–1548.

[Glocker et al., 2006] Glocker,M.O., Guthke,R., Kekow,J. and Thiesen,H.J. (2006)Rheumatoid arthritis, a complex multifactorial disease: on the way toward in-dividualized medicine. Med Res Rev, 26 (1), 63–87.

[Gomez and Rzhetsky, 2002] Gomez,S.M. and Rzhetsky,A. (2002) Towards theprediction of complete protein–protein interaction networks. In Pacific Sym-posium in Biocomputing pp. 413–424.

[Hood et al., 2004] Hood,L., Heath,J.R., Phelps,M.E. and Lin,B. (2004) Systemsbiology and new technologies enable predictive and preventative medicine. Sci-ence, 306 (5696), 640–643.

[Kitano, 2002] Kitano,H. (2002) Computational systems biology. Nature, 420(6912), 206–210.

[Lander et al., 2001] Lander,E.S. et al. (2001) Initial sequencing and analysis ofthe human genome. Nature, 409 (6822), 860–921.

[Lawrence et al., 1993] Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S.,Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtle sequence signals: aGibbs sampling strategy for multiple alignment. Science, 262 (5131), 208–214.

[Lenhard et al., 2003] Lenhard,B., Sandelin,A., Mendoza,L., Engstrm,P., Jare-borg,N. and Wasserman,W.W. (2003) Identification of conserved regulatory el-ements by comparative genome analysis. J Biol, 2 (2), 13.

[Levy et al., 2001] Levy,S., Hannenhalli,S. and Workman,C. (2001) Enrichment ofregulatory signals in conserved non-coding genomic sequence. Bioinformatics,17 (10), 871–877.

7

[Pitre et al., 2006] Pitre,S., Dehne,F., Chan,A., Cheetham,J., Duong,A., Emili,A.,Gebbia,M., Greenblatt,J., Jessulat,M., Krogan,N., Luo,X. and Golshani,A.(2006) PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs.BMC Bioinformatics, 7, 365.

[CSAC, 2005] Chimpanzee Sequencing and Analysis Consortium (2005) Initial se-quence of the chimpanzee genome and comparison with the human genome.Nature, 437 (7055), 69–87.

[Siddharthan et al., 2005] Siddharthan,R., Siggia,E.D. and van Nimwegen,E.(2005) PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny.PLoS Comput Biol, 1 (7), e67.

[Siepel et al., 2005] Siepel,A., Bejerano,G., Pedersen,J.S., Hinrichs,A.S., Hou,M.,Rosenbloom,K., Clawson,H., Spieth,J., Hillier,L.W., Richards,S., Wein-stock,G.M., Wilson,R.K., Gibbs,R.A., Kent,W.J., Miller,W. and Haussler,D.(2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeastgenomes. Genome Res, 15 (8), 1034–1050.

[Tompa et al., 2005] Tompa,M. et al. (2005) Assessing computational tools for thediscovery of transcription factor binding sites. Nat Biotechnol, 23 (1), 137–144.

[Waterston et al., 2002] Waterston,R.H. et al. (2002) Initial sequencing and com-parative analysis of the mouse genome. Nature, 420 (6915), 520–562.

[Weston and Hood, 2004] Weston,A.D. and Hood,L. (2004) Systems biology, pro-teomics, and the future of health care: toward predictive, preventative, andpersonalized medicine. J Proteome Res, 3 (2), 179–196.

[Wojcik and Schchter, 2001] Wojcik,J. and Schchter,V. (2001) Protein-protein in-teraction map inference using interacting domain profile pairs. Bioinformatics,17 Suppl 1, S296–S305.

CHAPTER 2

ROBUST METHODS FOR MULTINOMIAL GOODNESS-OF-FIT

TEST

2.1 Introduction

In a review paper Cressie and Read write [Cressie and Read, 1989]: “The im-

portance of developing useful and appropriate statistical methods for analyzing

discrete multivariate data is apparent from the enormous amount of attention this

subject has commanded in the literature over the last thirty years. Central to

these discussions has been Pearson’s X2 statistic and the loglikelihood ratio statis-

tic G2”. The methods for computing the p-value of the G2 statistic can be broadly

divided into two categories: asymptotic approximations and exact methods. In this

chapter, we introduce a new exact method (bagFFT) for estimating the p-value of

the G2 statistic although our method might be applicable to Pearson’s X2 as well.

We then show how it can be combined with an existing algorithm [Keich, 2005] to

get an improved algorithm (bag-sFFT) for evaluating the significance of sequence

motifs. We begin by presenting the problem from a statistical perspective and

present the motivation from bioinformatics in Section 2.2.

The classical approach to estimating the p-value of G2 relies on the asymptotic

result PH0(G2 ≥ s) −−−→

N→∞χ2K−1(s) , where H0 is the null multinomial distribu-

tion specified by π = (π1, . . . , πK) and N is the multinomial sample size (e.g.

[Cressie and Read, 1989]). While the χ2 approximation is a valid asymptotic re-

sult, in applications where N is fixed as s approaches the tail of the distribution

the approximation can be quite poor. For example, as can be seen in Figure 2.1 for

K = 20, πi = i/210 and N = 100, the χ2 approximation can be more than a factor

8

9

0 100 200 300 400 500 600−300

−250

−200

−150

−100

−50

0

50

s

log

p−va

lue

LLR vs χ2 (N=100, K=20, πk=k/210)

LLRχ2

Figure 2.1: Inaccuracy of the χ2 approximation.

of 1010 off in the tail of the distribution. The χ2 approximation can be improved

by adding second order terms [Cressie and Read, 1989]. However, the resulting

values [Siotani and Fujikoshi, 1984][Cressie and Read, 1984] are only accurate to

O(N−3/2) which is often significantly bigger than the p-values that have to be es-

timated. In particular, this is typically the case for applications in bioinformatics,

some of which are mentioned in Section 2.2 below.

Baglivo et al. addressed this problem by suggesting a lattice based exact method

[Baglivo et al., 1992]. The idea is to estimate the p-value directly from the under-

lying multinomial distribution. More specifically, as explained in Section 2.3 below,

they compute the characteristic function of a latticed version of G2 in O(QKN2)

time where Q is the size of the lattice that controls the accuracy of the estimated

p-value. Later Hirji proposed an algorithm [Hirji, 1997] based on Mehta and Pa-

10

tel’s network algorithm [Mehta and Patel, 1983]. While Hirji’s, essentially branch

and bound, algorithm can be implemented without resorting to a lattice (see also

[Bejerano et al., 2004]), only in the latticed case is it guaranteed to have polyno-

mial complexity. In that case Hirji’s algorithm shares the same worst-case runtime

as that of Baglivo et al.’s: O(QKN 2). As far as the space overhead, Baglivo et

al.’s algorithm is better with a space overhead of O(Q+N) as opposed to O(QN)

for Hirji’s. However, Baglivo et al.’s algorithm is prone to large numerical errors

(see Section 2.3) which make it unusable for computing the small p-values that

are of most interest in this discussion, while Hirji’s algorithm can be shown to be

numerically stable. In this chapter, we present a new algorithm that yields the

exact (up to lattice errors) p-value of G2 in O(QKN logN) time and O(Q + N)

space.

After a brief overview of applications in bioinformatics we present Baglivo et

al.’s algorithm in Section 2.3 and (in Section 2.4) modify it using the shifted-FFT

technique [Keich, 2005] to control the numerical errors in the algorithm. This re-

sults in a O(QKN 2) algorithm that can accurately compute small p-values. We

also present a mathematical analysis of the total roundoff error in computing the

p-value. We then use shifted-FFT based convolutions to reduce the runtime to

O(QKN logN) and obtain the bagFFT algorithm in Section 2.5 (with error anal-

ysis). Both variants share Baglivo et al.’s space requirement of O(Q + N). In

Section 2.6 we present experimental results that demonstrate the reliability and

improved efficiency of bagFFT in comparison to Hirji’s algorithm. Finally, in Sec-

tion 2.7 we discuss ways to combine it with the work in [Keich, 2005] to compute

the significance of sequence motifs.

11

2.2 Motivation from bioinformatics

In the analysis of multiple-sequence alignments one often evaluates the significance

of an alignment column using a goodness-of-fit test between the column’s residue

distribution and a given background distribution. Commonly one computes the

information content, or generalized loglikelihood ratio of the column defined as

I = G2/2 =∑K

j=1 nj lognj/N

πj, where K is the size of the alphabet, nj is the number

of occurrences of the jth letter in the column, πj is its background frequency and

N is the depth of the column. The p-value of I serves as a uniform measure of

the column’s significance that can be compared between columns of varying sizes

and background distributions. For example, in [Rahmann, 2003] p-values are used

to design a conservation index for alignment columns. These indices can then

be used to compare and visualize (as sequence logos) the conservation profile for

alignments of different sizes. In [Sadreyev and Grishin, 2004], a similarly defined

p-value is suggested as a means to detect misaligned regions in sequence alignments

(among other applications). Extending this technique to distributions of residue-

pairs, [Bejerano et al., 2004] discusses its use for detecting correlated columns that

serve as signatures for binding sites and RNA base pairs.

Motif finding programs such as MEME [Bailey and Elkan, 1994] and CONSEN-

SUS [Hertz and Stormo, 1999] seek statistically significant (ungapped) alignments

in the input sequences. These alignments are presumably the instances of the

putative motif. The alignments are scored with IA =∑L

j=1 Ij, where Ij is the

information content of the jth of the alignment’s L columns [Stormo, 2000]. In

order to compare two alignments of varying L and N (number of sequences in the

alignment) one assumes the columns are i.i.d. and replaces IA with its p-value. One

way to compute this p-value is by convoluting the pmf of the individual Ij whose

12

computation is the subject of this chapter. This application is studied further in

Section 2.7.

Typically, in the applications mentioned here, there are several competing

columns (or sets of columns) that need to be evaluated for their significance. The

twofold consequences are: firstly, to compensate for a huge number of multiple

hypotheses these algorithms need to reliably compute extremely small p-values

corresponding to the significant and putatively more interesting columns. Sec-

ondly, the runtime efficiency of the algorithm is very important. Indeed, these

explain the interest the bioinformatics community has shown in exact methods

for computing the p-value of I, or equivalently, of G2 [Hertz and Stormo, 1999,

Bejerano et al., 2004, Rahmann, 2003].

2.3 Baglivo et al.’s algorithm

We begin with a formal introduction of the problem. Given a null multinomial

distribution π = (π1, . . . , πK) and a random sample n = (n1, . . . , nK) of size N =

∑nk let s = I(n) =

∑k nk log nk

Nπkand note that I = G2/2. The p-value of s

is PH0(I ≥ s). Since for a given N and an arbitrary π the range of I can have

an order of NK−1 distinct points, strictly exact methods are typically impractical

even for moderately sized K. Thus, we are forced to map the range of I to a lattice

and compute exact probabilities on the lattice. Explicitly, let πmin = min{πk} and

let Imax = N log π−1min be the maximal entropy value. Given the size of the lattice,

Q, let δ = δ(Q) = Imax/(Q − 1) be the mesh size. Our surrogate for I(n) is the

integer valued

IQ(n) =∑

k

round[δ−1nk log(nk/(Nπk))

],

13

so that δIQ ≈ I 1. Let pQ be the pmf of IQ then, clearly, for any s,

L(s) =∑

j≥ds/δ+K/2epQ(j) ≤ P (I ≥ s) ≤

∑

j≥bs/δ−K/2cpQ(j) = U(s), (2.1)

which allows us to estimate the p-value and control the lattice error via adjustments

to Q.

Baglivo et al. compute pQ by inverting its characteristic function. More pre-

cisely, they compute the DFT (Discrete Fourier Transform [Press et al., 1992]) of

pQ, Φ, where:

Φ(l) := (DpQ)(l) =

Q−1∑

j=0

pQ(j)eiω0jl for l = 0, 1, . . . , Q− 1,

where ω0 = 2π/Q and recover pQ by applying D−1, the inverse-DFT:

pQ(j) = (D−1Φ)(j) =1

Q

Q−1∑

l=0

Φ(l)e−iω0lj.

In order for this procedure to be useful, one must be able to efficiently compute

Φ, keeping in mind that pQ is unknown. Baglivo et al. accomplish this based on

the observation that a multinomial distribution can be represented as the distribu-

tion of independent Poisson random variables conditioned on their sum being N .

Explicitly, let λk = Nπk, i.e., the mean number of occurrences of the k-th letter or

category, let sk(nk) = round[δ−1nk log(nk/λk)], i.e., the contribution to IQ from the

k-th letter appearing nk times, let pk denote the Poisson λ = λk pmf, and let X+ be

a Poisson λ = N random variable. Finally, let ψk,l(n) =∑

y

∏kj=1 pj(yj)e

ilω0sj(yj),

where the sum extends over all y ∈ ZK+ for which

∑kj=1 yj = n. It is not difficult

to check that ψk,l satisfy the following recursive formula:

ψk,l(n) =n∑

x=0

pk(x)eilω0sk(x)ψk−1,l(n− x), (2.2)

1Note that due to rounding effects IQ might be negative but we shall ignorethis as the arithmetic we perform is modulo Q. The concerned reader can redefineδ = Imax/(Q− 1− dK/2e).

14

and since as explained in [Baglivo et al., 1992],

Φ(l) =1

P (X+ = N)

∑

x∈ZK+

:Pxj=N

K∏

j=1

pj(xj)eiω0lsj(xj) =

ψK,l(N)

P (X+ = N),

Φ(l) can be recovered from (2.2) in O(KN 2) steps for each l separately2 and hence

O(QKN2) overall. Finally, using an FFT3 [Press et al., 1992] implementation of

DFT Baglivo et al. get an estimate of pQ in an additional O(Q logQ) steps (which

should typically be absorbed in the first term4).

The algorithm as it is has a serious limitation in that the numerical errors

introduced by the DFTs can readily dominate the calculations. An example of this

phenomena can be observed with the parameter values, Q = 8192, N = 100, K =

20 and πk = 1/20, where this algorithm yields a negative p-value (= −2.18 · 10−14)

for P (I ≥ 60).

2.4 Error control using shifted-FFT

The numerical instability of Baglivo et al.’s algorithm is illustrated by the following

simple example. Let p(x) = e−x for x ∈ {0, 1, . . . , 255} and q = D−1(Dp), where D

and D−1 are the machine implemented FFT and inverse FFT operators. As can be

seen in Figure 2.2, while theoretically equal, in practice the two differ significantly.

The analogous situation in Baglivo et al.’s algorithm is that p = pQ(j) and we

compute q = D−1(Dbagp) where Dbag is the recursive DFT computation in the

algorithm. As the example suggests we cannot compute the smaller entries of pQ

2To see this, note that we need to compute ψk,l(n) for k ∈ [1..K] and n ∈ [0..N ]and each computation takes O(N) time.

3Fast Fourier Transform, a fast algorithm for DFT with a runtime of O(n logn)for a vector of size n.

4As observed in [Rahmann, 2003], in order to preserve the bound on the distancebetween pQ and our real subject of interest, pI (the pmf of I), Q has to grow linearlywith N .

15

0 50 100 150 200 250 300−120

−100

−80

−60

−40

−20

0

x

log 10

f(x)

Numerical errors in FFT

f(x) = p(x)f(x) = q(x)

Figure 2.2: The destructive effects of numerical roundoff errors in FFT

This figure illustrates the potentially overwhelming effects of numerical errors inapplications of FFT. p(x) = e−x for x ∈ {0, 1, . . . , 255} is compared with what

should (in the absence of numerical errors) be the same quantity: q = D−1(Dp),

where D and D−1 are the machine implemented FFT and inverse FFT operators,respectively. This dramatic difference all but vanishes when we apply the correctexponential shift prior to applying D.

using Baglivo et al.’s algorithm. This limitation arises from the fact that we work

with fixed-precision arithmetic on computers and therefore can only approximate

the real arithmetic that we wish to do. For example, in the double precision

arithmetic that we usually work with ˜1 + 10−16 = 1 and therefore performing a

DFT on pQ discards the information about the entries of pQ that are smaller than

10−16 ·max{pQ}.

One possible remedy for the numerical errors is to move to higher precision

16

arithmetic. However, this only postpones the problem to smaller p-values and

also significantly slows down the runtime of the algorithm (due to a typical lack

of hardware support for higher precision arithmetic). A better solution (in the

spirit of [Keich, 2005]) is suggested by the following extension to the example

above: let pθ(x) = p(x)eθx and qθ = D−1(Dpθ

). For θ = 1, we experimentally get

maxx | log qθ(x)e−θx

p(x)| < 1.78 ·10−15, showing that using this mode of computation we

can “recover” p (from qθ(x)e−θx) almost up to machine precision (ε0 ≈ 2.2 · 10−16).

This solution is based on the intuition that by applying the correct exponential

shift we “flatten” p so that the smaller values are not overwhelmed by the largest

ones during the computation of the Fourier transforms.

Needless to say this exponential shift will not always work. However, the fol-

lowing bounds due to Hoeffding [Hoeffding, 1965] suggest that for fixed N and K,

“to first order”, the p-values and hence pQ behave like e−s:

c0N−(K−1)/2 exp(−s) ≤ P (I ≥ s) ≤

(N +K − 1

K − 1

)exp(−s), (2.3)

where c0 is a positive absolute constant which can be taken to be 1/2. This suggests

that we would benefit from applying an exponential shift to pQ. Let

pθ(j) =pQ(j)eθδj

M(θ),

where M(θ) = EeθδIQ , the MGF (moment generating function) of δIQ, serves to

normalize pθ and avoid numerical under/overflows. Figure 2.3 shows an example

of the flattening effect such a shift has on pQ. As can be seen in the figure, the

range of values in pθ is much smaller and therefore the largest values of pθ are no

longer expected to overwhelm the smaller values (in the context of FFTs).

The discussion so far implicitly assumed that we know pQ which of course we

do not. However, we can essentially compute Φθ = Dpθ by incorporating the shift

17

0 50 100 150 200 250 300 350 400 450−180

−160

−140

−120

−100

−80

−60

−40

−20

0Original pmf (N=100, K=10, πk=k/55, Q=16384)

log 10

pQ

s0 50 100 150 200 250 300 350 400 450

−8

−6

−4

−2

0

2

4

s

log 10

pθ

Shifted pmf for θ = 1 (N=100, K=10, πk=k/55, Q=16384)

Figure 2.3: How can an exponential shift help?

The graph on the left is that of log10 pQ(s/δ) where N = 100, K = 10, πk = k/55and Q = 16384. The graph on the right is of the log of the shifted pmf,log10 pθ(s/δ) where θ = 1. Note the dramatic flattening effect of the exponentialshift (keeping in mind the fact that the scales of the y-axes are different).

into the recursive computation in (2.2). We do so by replacing the Poisson pmfs

pk with a shifted version

pk,θ(x) = pk(x)eθδsk(x), (2.4)

and obtain the following recursion for ψk,l,θ(n) = ψk,l(n)eθδsk(n), the shifted version

of ψk,l(n):

ψk,l,θ(n) =n∑

x=0

pk,θ(x)eilω0sk(x)ψk−1,l,θ(n− x). (2.5)

where ψ1,l,θ(n) = p1,θ(n). This allows us to compute ψK,l,θ(N), an estimate of

ψK,l,θ(N)5 in the same O(KN 2) steps for each fixed l. We then compute an estimate

pQ of pQ based on

pQ(j) =(D−1ψK,•,θ(N)

)(j)

e−θδj

P (X+ = N). (2.6)

An additional feature of this approach (that is absent in Baglivo et al.’s algorithm)

5Due to unavoidable roundoff errors we cannot expect to recover ψK,l,θ(N) pre-cisely

18

is that we can directly estimate log pQ(j), in cases where computing pQ(j) would

create an underflow. This could be important in applications where very small

p-values are common, e.g. in a typical motif finding situation. Finally, the p-value

is estimated using (2.1) (or the logarithmic version of the summation).

Remark 2.1. In practice, to avoid under/overflows we normalize pk,θ(x) in (2.4) so

that it sums up to 1. These constants are then compensated for when computing

pQ in (2.6). We ignore these factors throughout this study.

Remark 2.2. For computing a single p-value, we can avoid inverting Φθ by noting

that for n ∈ [0..Q− 1],

∑

j≥npQ(j) =

∑

j≥npθ(j)e

−θδjM(θ) =∑

j≥ne−θδjM(θ)

Q−1∑

l=0

Φθ(l)e−iω0lj

Q

=M(θ)

Q

Q−1∑

l=0

Φθ(l)ez(l)n − ez(l)Q

1− ez(l)

where z(l) = −(θδ+iω0l). This version of the algorithm is however only marginally

more efficient while having a relative error that is more than 10 times worse, in

some cases, than that for the presented algorithm (and so we do not pursue it

further here).

2.4.1 Choosing θ

An obvious choice for θ that is suggested by inequality (2.3) is to set it to 1

and indeed it typically yields the widest range of js for which pQ(j) provides a

“decent” approximation of pQ(j). However, for computing the p-value of a given

s there would typically be a better choice of θ. As we can see from Figure 2.4,

a shift of θ = 1 could lead to the loss of values in the tail of the pmf during the

DFT computation. If we wish to compute a p-value in this region then setting

19

θ = 1 would perform poorly. Intuitively, we wish to choose a θ to ensure that

the entries of pθ around bs/δc are not overwhelmed during the DFT computation.

The solution we adopt is borrowed from the theory of large-deviation: choose θ so

as to “center” pθ about s, or more precisely, set the mean of pθ to s. This can be

accomplished by setting θ to [Dembo and Zeitouni, 1998]:

θs = argminθ [−θs + logM(θ)] (2.7)

The minimization procedure in (2.7) can be carried out numerically6 by using, for

example, Brent’s method [Press et al., 1992]. The runtime cost for this is essen-

tially a constant factor of the cost of evaluating M(θ). The latter can be reliably

estimated in O(KN 2) steps by replacing eilω0sk(x) with eθδsk(x) in (2.2). The runtime

of the shifted-FFT based algorithm is therefore still O(QKN 2).

The following claim allows us to gauge the magnitude of the numerical errors

introduced by our algorithm.

Claim 2.1.

|pQ(j)−pQ(j)| ≤ C(KN logN+logQ)ε0e−θδj+logM(θ) +CN logN pQ(j)ε0 +O(ε2

0),

(2.8)

where C is a small universal constant and ε0 is the machine precision.

Remark 2.3. The O(ε20) term refers to all higher order terms in an ε0 power series

expansion of the accumulated roundoff error. The bound in (2.8) is only useful

when it is � pQ(j). In that case the propagation of roundoff errors is essentially

linear and therefore the O(ε20) term is negligible compared to the O(ε0) term (e.g.

[Tasche and Zeuner, 2001]).

6A crude approximation of θs would typically suffice for our purpose.

20

0 200 400 600 800 1000 1200 1400−25

−20

−15

−10

−5

0

5

10

15

20

25

s

log 10

f(s/

δ)

Perils of using θ = 1 (N=200, K=40, πk=k/820, Q=16384)

f=pθ

f=D−1(Dpθ)

Figure 2.4: Numerical errors in estimating pθ with θ = 1

Remark 2.4. The Claim only holds in the absence of intermediate over/under-flows.

In practice remark 2.1 guarantees this condition but in any case such events are

detectable.

Proof of Claim 2.1. In order to prove this claim we use the following lemma that

can be readily derived from the results in [Keich, 2005] (see lemmas 1-3, (20) &

(21)). For α ∈ C we denote by α its machine estimator and define eα = α − α.

For α, β ∈ C, we define

eα+β =˜α + β − (α + β),

and similarly for eαβ.

21

Lemma 2.1. If |eα| < cα|α|ε0 and |eβ| < cβ|β|ε0, then

|eα+β| ≤ (max{cα, cβ}+ 1)(|α|+ |β|)ε0

|eαβ| ≤ (cα + cβ + 5)(|αβ|)ε0.

Let,

pk,l,θ(x) = pk,θ(x)eilω0sk(x). (2.9)

Then from the fact that |eiφ| = 1, we have,

|pk,l,θ(n)− pk,l,θ(n)| ≤ CN logN |pk,l,θ(n)|ε0 = CN logNpk,0,θ(n)ε0.

Combining this bound with the previous lemma one can use (2.5) to prove by

induction on k that

|ψk,l,θ(n)− ψk,l,θ(n)| ≤ (CkN logN)ψk,0,θ(n)ε0.

In particular, with ρ(l) = ψK,l,θ(N)

|ρ(l)− ρ(l)| ≤ (CKN logN)M(θ)P (X+ = N)ε0. (2.10)

Let D be the m-dimensional DFT operator. It is easy to show that for v ∈ Cm

‖Dv‖∞ ≤ ‖v‖1 , ‖D−1v‖∞ ≤1

m‖v‖1 ≤ ‖v‖∞. (2.11)

Let D denote the FFT machine implementation of the DFT. Then, there exists a

constant CF < 5 such that [Tasche and Zeuner, 2001]:

‖(D−1 −D−1)v‖2 ≤1√mCF log2 (m)ε0‖v‖2 +O(ε2

0)

‖(D −D)v‖2 ≤√mCF log2 (m)ε0‖v‖2 +O(ε2

0).

(2.12)

Then from ‖v‖∞ ≤ ‖v‖2 ≤√m‖v‖∞, we have,

‖(D−1 − D−1)v‖∞ ≤ CF log2 (m)ε0‖v‖∞ +O(ε20). (2.13)

22

Using the triangle inequality, (2.10), (2.11), and (2.13) we get

‖D−1ρ− D−1ρ‖∞ ≤ ‖D−1(ρ− ρ)‖∞ + ‖(D−1 − D−1)ρ‖∞

≤ ‖ρ− ρ‖∞ + CF log2Qε0‖ρ‖∞ +O(ε20)

≤ C(KN logN + log2Q)M(θ)P (X+ = N)ε0 +O(ε20).

Claim 2.1 now follows from multiplying by e−θδj/P (X+ = N) (cf. (2.6)).

Summing over j in (2.8) yields an upper bound on the error in computing the

p-value. Note that if s = δj, the upper bound in Claim 2.1 is essentially minimized

for θ = θs (as the relative error term of CN logN pQ(j)ε0 is typically negligible),

thus giving us another justification for our choice of θ. In Section 2.6 we show that

this choice of θ works well in practice and that the theoretical error bounds there

can be applied fruitfully.

2.5 Improving the runtime

The algorithm presented in Section 2.4 is free of the large numerical errors that

plague Baglivo et al.’s algorithm while preserving its time and space complexity.

Observing that (2.5) can be expressed as a convolution between the vectors pk,l,θ

and ψk−1,l,θ allows us to improve the runtime of our algorithm as follows. A naively

implemented convolution requires O(N 2) steps and hence that factor in the overall

runtime complexity. Alternatively, we can carry out an FFT-based convolution,

based on the identity (D(u ∗ v)) (j) = (Du)(j)(Dv)(j)7 [Press et al., 1992], where

u∗v is the convolution of the vectors u and v. This would only require O(N logN)

steps8, cutting down the overall complexity to O(QKN logN + Q logQ + KN 2).

7A special case of the identity for the characteristic function of a sum of twoindependent random variables (X and Y , say): φX+Y = φXφY .

8As the FFT of a vector of size N can be computed in O(N logN) time.

23

Typically the last two terms are small compared to the runtime cost of the main

loop thus giving us a O(QKN logN) algorithm.

Simply implementing (2.5) using an FFT-based convolution, however, reintro-

duces the severe numerical errors that were corrected for in Section 2.4. The fol-

lowing example illustrates the situation: for θ = 1 one can verify that |pk,l,θ(x)| ≈

e−Nπk+x/√

2πx. Computing Dpk,l,θ therefore faces essentially the same problem

as the one demonstrated in our example of FFT applied to e−x. Once again the

solution we propose is to apply an appropriate exponential shift: for a vector u let

uα(x) = u(x)e−αx and let u� v denote the pointwise product of u and v, then one

can readily show that

(u ∗ v)α ≡ D−1 [Duα �Dvα] .

Based on the last identity we replace the shifted convolution of (2.5) with its

doubly shifted Fourier version:

ψk,l,θ,θ2(n) = D−1 [Dpk,l,θ,θ2 �Dψk−1,l,θ,θ2] (n) n = 0, 1, . . . , N − 1, (2.14)

where

pk,l,θ,θ2(x) = pk,l,θ(x)e−θ2x ψk,l,θ,θ2(x) = ψk,l,θ(x)e

−θ2x.

One final detail is that pk,l,θ,θ2 and ψk−1,l,θ,θ2 are padded with zeros (otherwise, you

get cyclic convolution [Press et al., 1992]) so that they are now vectors of length

N2 = 2N − 1 and D = DN2.

Analogous to (2.6) we recover pQ from

pQ(j) =(D−1ψK,•,θ,θ2(N)

)(j)

e−θδj+θ2N

P (X+ = N), (2.15)

and here D−1 = D−1Q .

24

2.5.1 Analysis of the convolution error

The main result of this section is the one stated in Corollary 2.1 which we show

using the following technical lemmas and claims.

Lemma 2.2. Suppose that for x, y, x, y ∈ RN

‖x− x‖2 ≤ mxε0 ‖y − y‖2 ≤ myε0.

Choose N2 ≥ 2N − 1 and with D = DN2, the corresponding DFT operator, let

τ = Dx ν = Dy τ = Dx ν = Dy,

where the vectors are padded with zeros. Then,

‖D−1 ˜τ � ν −D−1τ � ν‖2 ≤ ε0

[(2CF log2N2 + 5)‖x‖1‖y‖2+

CF log2N2‖y‖1‖x‖2 + ‖y‖1mx + ‖x‖1my

]+O(ε2

0),

where (u� v)(k) = u(k)v(k), � is the machine computation of �.

Remark 2.5. The remarks following Claim 2.1 are valid here as well.

Proof of Lemma 2.2. Let D be the m-dimensional DFT. The discrete Parseval

identity (e.g. [Press et al., 1992]) states that for v ∈ Cm,

‖D−1v‖2 =1√m‖v‖2 , ‖Dv‖2 =

√m‖v‖2. (2.16)

The following bound on the norm of a convolution is used repeatedly below. Let

u, v ∈ Cm, then it follows from (2.11) and (2.16) (with � being the pointwise

product operator) that

1√N2

‖Du�Dv‖2 ≤1√N2

‖Du‖2‖Dv‖∞ ≤1√N2

‖Du‖2‖v‖1 = ‖u‖2‖v‖1. (2.17)

25

We are now ready to prove the lemma.

‖D−1 ˜τ � ν − x ∗ y‖2 ≤ ‖D−1(τ � ν − ˜τ � ν)‖2︸︷︷︸α

+ ‖(D−1 −D−1) ˜τ � ν‖2︸︷︷︸β

. (2.18)

From (2.11)-(2.17) and lemma 2.1 we have

α =1√N2

‖τ � ν − ˜τ � ν‖2

≤ 1√N2

‖(τ − τ )� ν‖2︸︷︷︸

α1

+1√N2

‖τ � (ν − ν)‖2︸︷︷︸

α2

+1√N2

‖τ � ν − ˜τ � ν‖2︸︷︷︸

α3

,

where

α1 ≤1√N2

‖τ − τ‖2‖y‖1

≤[

1√N2

‖D(x− x)‖2 +1√N2

‖(D − D)x‖2]‖y‖1

≤ ε0

[mx + CF log2N2‖x‖2

]‖y‖1 +O(ε2

0).

α2 ≤1√N2

‖ν − ν‖2‖τ‖∞

≤[ε0 (my + CF log2N2‖y‖2) +O(ε2

0)] [‖(D −D)x‖∞ + ‖Dx‖∞

]

≤ ε0 [my + CF log2N2‖y‖2] ‖x‖1 +O(ε20).

α3 ≤ 5ε01√N2

‖τ � ν‖2

≤ 5ε01√N2

‖ν‖2‖τ‖∞

≤ 5ε0

[1√N2

‖(D −D)y‖2 +1√N2

‖Dy‖2]

[‖x‖1 +O(ε0)]

≤ 5ε0‖x‖1‖y‖2 +O(ε20).

Finally, by the same type of arguments

β ≤ ε0CF log2N2‖˜τ � ν‖2 ≤ ε0CF log2N2‖x‖1‖y‖2 +O(ε20).

26

The proof is completed by collecting all the terms into (2.18) and noting that

the differences between ‖y‖ and ‖y‖ (or ‖x‖ and ‖x‖) are absorbed in the O(ε20)

term.

Let

∆pk = ∆p

k(θ, θ2) = maxl‖pk,l,θ,θ2 − pk,l,θ,θ2‖2/ε0,

and inductively define ∆ψk as: ∆ψ

1 = ∆p1 and for k = 2, . . . , K

∆ψk = ‖pµ‖1

((2CF log2N2 + 5)‖ψµ‖2 + ∆ψ

k−1

)+ ‖ψµ‖1(CF log2N2‖pµ‖2 + ∆p

k),

(2.19)

where µ stands for (k, 0, θ, θ2), and CF is a constant < 5 that controls the l2 norm

of the numerical errors introduced by the FFT [Tasche and Zeuner, 2001] (see also

(2.12) below).

We now establish the following error bound on ψk,1,θ,θ2:

Claim 2.2. Let ψk,l,θ,θ2 denote the estimate of ψk,l,θ,θ2 computed by (2.14). For

k = 1, . . . , K:

maxl‖ψk,l,θ,θ2 − ψk,l,θ,θ2‖2 ≤ ∆ψ

k ε0 +O(ε20).

Remarks. • ∆pk depends on the particular implementation of computing pk,l,θ,θ2.

The only delicate point is when computing exp(ilω0sk(x)) one should com-

pute lsk(x) mod Q, otherwise ∆pk will grow linearly with Q. With this in

mind, a naive computation of the other factors would result in

∆pk ≤ CN logN‖pk,0,θ,θ2‖2,

where C is some small constant.

• Analogous to Remark 2.1, we normalize pk,l,θ,θ2 so that ‖pk,l,θ,θ2‖1 = 1 in

practice. Again, we ignore this practical step in the discussion below.

27

• The remarks following Claim 2.1 are valid here as well.

Proof of Claim 2.2. By induction on k. For k = 1 the claim follows immediately

from the definitions. Let x = pk,l,θ,θ2 and y = ψk−1,l,θ,θ2. Clearly, ‖x− x‖2 ≤ ∆pkε0

and by the inductive hypothesis ‖y−y‖2 ≤ ∆ψk−1ε0+O(ε2). The claim follows from

Lemma 2.2, ‖pk,l,θ,θ2‖i = ‖pk,0,θ,θ2‖i and ‖ψk,l,θ,θ2‖i ≤ ‖ψk,0,θ,θ2‖i, for i = 1, 2.

Using the last claim, we establish the following error bound on pQ:

Claim 2.3. Let pQ be computed according to (2.15). Also, let

∆pθ=

[∆ψKe

θ2N

M(θ)P (X+ = N)+ CF log2Q

].

Then,

|pQ(j)− pQ(j)| ≤ ε0∆pθe−θδj+logM(θ) + CN logN pQ(j)ε0 +O(ε2

0) (2.20)

where C is a small universal constant.

Remarks. • The remarks following Claim 2.1 are valid here as well.

• When computing ∆ψK from (2.19) we plug in pµ and ψµ for pµ and ψµ re-

spectively. Still, (2.20) holds since by Claim 2.2 and its following remark the

difference can be absorbed in the O(ε20) term.

Proof of Claim 2.3. For l = 0, . . . , Q− 1 let ρ(l) = ψK,l,θ,θ2(N). Then

‖D−1ρ− D−1ρ‖2 ≤ ‖D−1(ρ− ρ)‖2 + ‖(D−1 − D−1)ρ‖2

≤ 1√Q‖ρ− ρ‖2 +

1√QCF log2Qε0‖ρ‖2 +O(ε2

0)

≤ ‖ρ− ρ‖∞ + CF log2Qε0‖ρ‖∞ +O(ε20)

≤[∆ψK + CF log2QM(θ)P (X+ = N)e−θ2N

]ε0 +O(ε2

0),

(2.21)

28

where the last inequality follows from Claim 2.2 and

|ψK,l,θ,θ2(N)| ≤ ψK,0,θ,θ2(N) = M(θ)P (X+ = N)e−θ2N .

The proof now follows from

pQ(j) = (D−1ρ)(j)e−θδj+θ2N

P (X+ = N).

Corollary 2.1. For n ∈ [0..Q− 1] and a small universal constant C,

|∑

j≥npQ(j)−

∑

j≥npQ(j)| ≤

∑

j≥n

[∆pθ

e−θδj+logM(θ) + (Q+ CN logN)pQ(j)]ε0 +O(ε2

0)

Remarks. • The proof of the corollary follows from Claim 2.3 and Lemma 2.1.

• The relative error term,∑

j≥n(Q+CN logN)pQ(j)ε0, tends to be negligible

in practice.

• A tighter bound can be obtained here from analysis of the l2-norm of the

error (using (2.15) and Claim 2.2) and from more careful summations.

Minimizing the bound in (2.20) for j = ds/δe is in principle a two-dimensional

optimization problem. However, we found that first solving (2.7) for θ and then

choosing θ2 that minimizes ∆ψKe

θ2N works sufficiently well in practice. We present

a summary of the bagFFT algorithm in Figure 2.5. As the θ2 computation adds

only O(KN logN) to the runtime, the runtime of this algorithm is O(QKN logN).

2.5.2 An illustration of the bagFFT algorithm

In Figure 2.6 we present an illustrated example for the core of the bagFFT algo-

rithm, i.e. computing ψk,l,θ,θ2 starting from the pk’s. The parameters used in this

example are N = 100, K = 10, π = {(10−i)/55|i ∈ [0..9]}, s = 100 and Q = 16384.

29

Given N,K, π,Q and s, bagFFT:

1. Computes θ by numerically solving (2.7) (using Brent’s method).

2. Computes θ2 by minimizing ∆ψKe

θ2N computed from (2.19) (using

Brent’s method).

3. For each l = 0, 1 . . . , Q− 1, recursively computes ψK,l,θ,θ2(N) using (2.14).

4. Using FFT computes u = D−1ψK,•,θ,θ2(N).

5. Computes pQ(j) = u(j) e−θδj+θ2N

P (X+=N), or log pQ(j) = log u(j)

P (X+=N)− θδj + θ2N .

6. Returns L(s) and U(s), computed using (2.1), as the lower and upper

bounds on the p-value respectively (or the logarithmic version of the sum).

7. Computes the theoretical error bounds, EL(s) and EU(s) for L(s) and U(s)

respectively, using Corollary 2.1.

Figure 2.5: The bagFFT algorithm

30

0 50 100 150−100

−80

−60

−40

−20

0Plot of pk for k = 1

x

log(

p k(x))

0 50 100 150−150

−100

−50


x

log(

p k(x))

0 50 100 150−400

−300

−200

−100


x

log(

p k(x))

0 50 100 150−20

0

20

40

60

80Plot of pk, θ for k = 1

x

log(

p k, θ

(x))

0 50 100 150−20

0

20

40

60

80


x

log(

p k, θ

(x))

0 50 100 150−20

0

20

40

60

80


x

log(

p k, θ

(x))

0 50 100 150−26

−24

−22

−20

−18

Plot of pk, 0, θ, θ2 for k = 1

x

log(

p k, 0

, θ, θ

2(x))

0 50 100 150−20

−18

−16

−14

−12

−10


x

log(

p k, 0

, θ, θ

2(x))

0 50 100 150−10

−8

−6

−4

−2

0


xlo

g(p k,

0, θ

, θ2(x

))

0 50 100 150−40

−39

−38

−37

−36

−35

−34

n

log(

ψk,

0, θ

, θ2(n

))

Plot of ψk, 0, θ, θ2 for k = 2

FFTNaive

0 50 100 150−73

−72.5

−72

−71.5

−71

−70.5

n

log(

ψk,

0, θ

, θ2(n

))


FFTNaive

0 50 100 150−102

−100

−98

−96

−94

−92

−90

n

log(

ψk,

0, θ

, θ2(n

))


FFTNaive

Figure 2.6: Graphical illustration of the bagFFT algorithm

Computation using the pk’s shown in row 1 leads to the roundoff errors describedin Figure 2.2. So a shift with θ = 1 is applied to get the pk,θ’s shown on row 2.To aid FFT-convolutions using the pk,θ’s, they are shifted with θ2 = 1.05 to getthe pk,0,θ,θ2’s on row 3 (note the different scale from the previous row). These arenow convolved (using FFTs) to accurately recover the ψk,0,θ,θ2’s, as can be seenfrom row 4 (by comparison to the curves from naive convolution that overlapvery well). Note that corresponding FFT-convolutions with the pk,θ’s (withoutthe second shift) does not recover any of the entries of ψk,0,θ accurately (data notshown).

31

Table 2.1: Range of parameters for testing bagFFT

Parameter Values

K 4, 10, 20

N 50, 100, 200, 400

π Uniform, Sloped, Blocked

s i21∗ Imax i ∈ [1..20]

Uniform refers to the distribution where πk = 1/K, Sloped refers to the casewhere πk = k/(K ∗ (K + 1)/2), and Blocked refers to the case whereπk = 3/(4bK/4c) if k ≤ bK/4c and πk = 1/(4 ∗ (K − bK/4c)) otherwise.

2.6 Results

2.6.1 Accuracy

As a test of accuracy for bagFFT we compared its results to those from a lattice

version of Hirji’s algorithm (which can be proven to be numerically stable). The

range of parameters for the comparison is given in Table 2.1. The comparison

was done using C implementations and with double precision arithmetic. For the

set of 720 test cases defined by Table 2.1 and with Q set to 16384 we found that

bagFFT agreed with Hirji’s algorithm to more than 12 decimal places in all cases.

The same experiment was also repeated with values of s that are much closer to

Imax: an interval halving procedure on the range [( 2021∗ Imax)..Imax] was used to

get 8 values of s. The agreement was again to more than 12 decimal places. In

addition, in both these experiments the theoretical error bounds from Figure 2.5

guarantee nearly 6 decimal places of accuracy in all cases.

The set of parameters in Table 2.1 is restricted to small values of N and K and

32

one reason this is so is because these are the typical ranges that are of interest in

bioinformatics applications. However, there is also a practical reason, which is that

Hirji’s algorithm is quite slow for large values of N and K (and it also requires

a substantial amount of memory). For example, for N = 10, 000, K = 20 and

Q = 16384, we estimated that Hirji’s algorithm would take at least 40 hours while

bagFFT takes about 25 minutes (for optimized C implementations). Fortunately,

we can compute error bounds for bagFFT to confirm that the computed values

are accurate. To verify that bagFFT is useful even for large values of N and

K we conducted two sets of tests. In the first test we allowed N to vary over

{1000, 2000, 5000, 10000} where the other parameters vary as before. In this case,

the theoretical error bounds from (2.20) guarantee more than 4 decimal places of

accuracy in all cases. In the second test, we varied K over {50, 75, 100, 200} with

the other parameters varying as before. For this experiment, the guarantee is still

more than 3 decimal places for all the cases tested.

The behavior of the theoretical error bounds and the agreement of bagFFT

with Hirji’s algorithm, as a function of N , K and Q, is illustrated in Figure 2.7.

Here we define agreement with Hirji’s algorithm as

− log10(max(|LH(s)− L(s)|/LH(s), |UH(s)− U(s)|/UH(s)))

where LH and UH are the corresponding lattice bounds for the p-value reported by

Hirji’s algorithm. Correspondingly, the theoretical error guarantee is calculated as

− log10(max(EL(s)/|L(s)− EL(s)|, EU(s)/|U(s)− EU(s)|))

An important trend to note here is that the agreement with Hirji’s algorithm is

essentially constant with increasing Q. In the rest of the cases the trend is that

accuracy decreases roughly linearly as a function of logN , logK and logQ. The

33

results therefore indicate that both the error bounds and the agreement with Hirji’s

algorithm are relatively stable for increasing N , K or Q.

Besides serving to confirm the accuracy of computed p-values, the theoreti-

cal error bounds are also useful for identifying the regions of the pmf that are

accurately computed. An example of this can be seen in Figure 2.8. Here the

theoretical bounds, while being conservative by design, can still be used to recover

nearly 60% of the correct entries of pθ (where we want both theoretical and actual

relative error to be less than 10%).

2.6.2 Runtime

For runtime comparisons we implemented bagFFT and Hirji’s algorithm in C with

particular attention to optimizing the runtime of the programs. Based on our

experiments we observed that while Hirji’s algorithm is efficient for small values of

N , bagFFT is faster as N increases. In particular, for K = 20, bagFFT is faster

for N > 30. The asymptotic behavior of the algorithms can be clearly seen in

Figure 2.9 where we plot the runtime of the two algorithms with increasing N for

a fixed choice of the other parameter values (the graph is similar looking for other

choices of the parameter values as well).

In columns 1 and 2 of Table 2.2 we present the runtime of Hirji’s algorithm

and bagFFT for a set of parameter values that demonstrate the typical behavior

of the algorithms. As can be seen from lines 2 and 4, while the choice of π does

not affect the runtime of bagFFT it does affect the runtime of Hirji’s algorithm.

For Hirji’s algorithm, π = Uniform is the worst case and the runtime decreases for

other choices of π. Also, as can be seen from lines 2,3 and 5, as K increases, the

“crossover point” between the runtime curves for bagFFT and Hirji’s algorithm

34

0

2

4

6

8

10

12

14

16

3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Acc

urac

y (in

dec

imal

pla

ces)

log(N)

Accuracy of bagFFT with varying N

Agreement with HirjiTheoretical guarantee

(a) K = 10, π = Uniform,

Q = 16384 and N varies over

{50, 100, 200, 400, 1000, 2000}

0

2

4

6

8

10

12

14

16

1 1.5 2 2.5 3 3.5 4 4.5 5

Acc

urac

y (in

dec

imal

pla

ces)

log(K)

Accuracy of bagFFT with varying K


(b) N = 200, π = Uniform,

Q = 16384 and K varies over

{4, 10, 20, 50, 100}

0

2

4

6

8

10

12

14

16

8 8.5 9 9.5 10 10.5 11 11.5

Acc

urac

y (in

dec

imal

pla

ces)

log(Q)

Accuracy of bagFFT with varying Q


(c) K = 10, N = 200,

π = Uniform and Q varies over

{4096, 8192, 16384, 32768, 65536}

Figure 2.7: Accuracy of bagFFT as a function of N, K and Q

The values reported here are the minimum values for s in the range{ i

21∗ Imax|i ∈ [1..20]}.

35

0 50 100 150 200 250 300 350 400 450−25

−20

−15

−10

−5

0

log 10

f(s/δ

)lo

g 10f(s

/δ)

log 10

f(s/δ

)

s

log 10

f(s/δ

)

log 10

f(s/δ

)

Practicality of theoretical error bounds (N=100, K=10, πk = k/55, s = 390, Q=16384) Practicality of theoretical error bounds (N=400, K=10, πk=k/55, s=1527, Q=16384)

f = pθf = Experimental error in pθf = Theoretical error bound in pθ

Figure 2.8: Practicality of (2.20) for estimating the error in pθ

Note the plotted values for pθ are those computed using the bagFFT algorithm.The region where these values are much larger than the theoretical error bounddefines the entries of pθ which can be trusted in practice. As can be seen, thisapproach can be used to recover a large proportion of the reliable entries of pθ.

36

0

0.5

1

1.5

2

2.5

20 40 60 80 100 120 140 160

Runt

ime

(in se

cond

s)

N

Runtime comparison with varying N

bagFFTHirji

Figure 2.9: Runtime comparison of bagFFT and Hirji’s algorithm

The parameter values used in this comparison are K = 20, Q = 1024 andπj = j/(K ∗ (K + 1)/2). The runtimes reported are averaged over 10 evenlyspaced s values in the range [0..Imax]. Note that the discontinuities in the curvefor bagFFT are due to the fact that our implementation of FFT works witharrays whose sizes are powers of 2.

37

Table 2.2: Runtime in seconds for various parameter values

Parameters Hirji bagFFT Hirji (no pruning)

N = 50, K = 4, π = Uniform 0.006 0.022 0.01

N = 400, K = 4, π = Uniform 0.4 0.4 1.3

N = 1600, K = 4, π = Uniform 13.1 4.7 44.5

N = 400, K = 4, π = Sloped 0.3 0.4 1.7

N = 50, K = 20, π = Uniform 0.3 0.13 0.7

N = 400, K = 20, π = Uniform 7.4 2.7 77.9

N = 1600, K = 20, π = Uniform 4.5 · 103 110.2 > 1.9 · 104

Note that Hirji (no pruning) refers to the version of the algorithm described inSection 2.7. Here, Q is set to 1024 and the runtimes reported are averaged over svalues in the range { i

11∗ Imax|i ∈ [1..10]} (except for the last line where

Q = 16384 and s = 3000).

becomes smaller. In other words, bagFFT becomes more efficient sooner, with

respect to N , as K increases. Finally, lines 6 and 7 demonstrate the substantial

difference in runtime between Hirji’s algorithm and bagFFT as N and K become

large.

2.7 Recovering the entire pmf and its application

So far our goal was to compute a single p-value, however, we often need to evaluate

many different values of I. In such cases it would be better to compute the entire

pmf, pQ, in advance. Hirji’s algorithm can be modified to compute pQ in the same

O(QKN2) time it can take to compute a single p-value. The difference, however,

is that in the case of a single p-value O(QKN 2) is a worst case analysis and in

many cases the computation is significantly faster. These savings which apply only

38

for computing a single p-value are due to the pruning that any network algorithm

[Mehta and Patel, 1983] such as Hirji’s employs.

While bagFFT was designed for computing a single p-value, in practice it can

be easily adapted to reliably estimate pQ in its entirety. In some cases it already

does that: for example, for s = 100, N = 100, K = 4, πk = 1/4 and Q = 16384 we

get a reliable estimate for all the entries in pQ (with relative error < 10−9). In all

cases that we tried we could reliably recover the entire range of values of pQ using

as little as 2-3 different s values, or equivalently, θs: recall that each estimate has

an error bound, based on (2.20), which allows us to choose the estimate which has

better error guarantees. This approach is typically still significantly cheaper than

running Hirji’s algorithm, especially since without pruning the latter is significantly

slower than bagFFT (even for much smaller N) as demonstrated in Figure 2.10

and Table 2.2.

As mentioned in Section 2.2, an important application for recovering pQ in its

entirety is the computation of the p-value of a sum of entropy scores, IA =∑

j I(j),

from L independent columns of an alignment. The sFFT algorithm [Keich, 2005]

applies an exponential shift to pQ so that it can use FFT to compute the L-fold

convolution p∗LQ . In the original implementation of sFFT, pQ was computed using

naive enumeration. Here we present a modification to sFFT that uses bagFFT to

compute pQ.

As suggested above, typically, a few applications of bagFFT can be used to

recover all the entries of pQ accurately. However, this approach may expend too

much effort in recovering entries of pQ that do not contribute significantly to the

p-value for a particular score. Indeed, from [Keich, 2005] we know that the entries

of pQ that are most relevant to computing the p-value of IA = sA are centered

39

0

1

2

3

4

5

6

7

8

20 40 60 80 100 120 140 160

Runt

ime

(in se

cond

s)

N

Runtime comparison with varying N

bagFFTHirji without pruning

Figure 2.10: Runtime comparison of bagFFT and Hirji (without pruning)

The parameter values used in this comparison are the same as in Figure 2.9.

40

Given N,K, L, π,Q and sA, the algorithm:

1. Executes steps 1-4 of Figure 2.5 with s = sA/L

2. Computes qθ(j) =

u(j) eθ2N

P (X+=N)= pθ(j)M(θ) j = 0, . . . , Q− 1

0 j = Q, . . . , LQ− 1

.

3. For l = 0, 1, . . . , LQ− 1, computes y(l) = [(Dqθ)(l)]L, where D = DLQ.

4. Computes w = D−1y.

5. Computes p∗LQ (j) = w(j)e−θδj (or the logarithmic version).

6. Returns∑

j≥dsA/δ+LK/2e p∗LQ (j) and

∑j≥bsA/δ−LK/2c p

∗LQ (j) as the lower

and upper bounds respectively for the p-value (or the logarithmic version).

Figure 2.11: The bag-sFFT algorithm

about sA/L, suggesting the bag-sFFT algorithm summarized in Figure 2.11. The

runtime for this algorithm is O(QKN logN + LQ log(LQ)).

The following claim bounds the magnitude of the accumulated roundoff error

in our computation.

Claim 2.4.

|pQ∗L(j)− pQ∗L(j)| ≤ ε0

[L∆pθ

+ (L + 1)CF log(LQ)]e−θδj+L logM(θ) +

CpQ∗L(j)ε0 +O(ε20)

where C is a small universal constant and with ∆pθas in Claim 2.3.

Proof of Claim 2.4. By the same arguments as in Claim 2.2, with D = DLQ and

41

w = D−1y as in Figure 2.11,

‖w − w‖2 ≤ ‖D−1(y − y)‖2 + ‖(D−1 − D−1)y‖2

≤ 1√LQ‖y − y‖2 +

1√LQ

CF log2(LQ)ε0‖y‖2 +O(ε20).

(2.22)

Let y(l) = [(Dqθ)(l)]L and let qθ ≡ eθ2N

P (X+=N)u1[0,...,Q−1] ≡ M(θ)pθ1[0,...,Q−1] as in

Figure 2.11. By (2.11)

‖y‖∞ ≤ ‖Dqθ‖L∞ ≤ ‖qθ‖L1 = [M(θ)]L.

It follows that,

1√LQ‖y‖2 ≤ [M(θ)]L +

1√LQ‖y − y‖2 (2.23)

and since |(a+ h)L − aL| ≤ L|h||a|L−1 +O(|h|2), that

|y(l)− y(l)| ≤ LM(θ)L−1|(Dqθ)(l)− Dqθ(l)|+O(ε20).

Therefore,

‖y − y‖2 ≤ LM(θ)L−1‖Dqθ − Dqθ‖2 +O(ε20). (2.24)

As u ≡ D−1Q [ψK,•,θ,θ2(N)] it follows from (2.21) that

‖qθ − qθ‖2 ≤ ε0

[ ∆ψKe

θ2N

P (X+ = N)+ CFM(θ) log2Q

]+O(ε2

0)

≤ ε0∆pθM(θ) +O(ε2

0),

and since

‖qθ‖2 ≤ ‖qθ‖1 = M(θ),

it follows that

‖Dqθ − Dqθ‖2 ≤ ‖D(qθ − qθ)‖2 + ‖(D − D)qθ)‖2

≤√LQ [‖qθ − qθ‖2 + CF log2(LQ)ε0‖qθ‖2] +O(ε2

0)

≤√LQ [∆pθ

M(θ) + CF log2(LQ)M(θ)] ε0 +O(ε20).

(2.25)

42

Table 2.3: Range of parameters for testing bag-sFFT

Parameter Values

L 5, 10, 15, 30

N 5, 10, 15, 20, 50

π Uniform, Sloped, Blocked, Perturbed Uniform

s i21∗ L ∗ Imax i ∈ [1..20]

Here K = 4, Uniform refers to the case where πk = 1/4, Sloped refers toπk = k/10, Blocked refers to π = [0.2, 0.2, 0.3, 0.3] and Perturbed Uniformrefers to π = [0.2497, 0.2499, 0.2501, 0.2503].

Plugging (2.24), (2.25) and (2.23) back into (2.22) we get:

‖w − w‖2 ≤ LM(θ)L−1 [∆pθM(θ) + CF log2(LQ)M(θ)] ε0

+ CF log2(LQ)‖y‖∞ε0 +O(ε20)

≤ [L∆pθ+ (L+ 1)CF log2(LQ)]M(θ)Lε0 +O(ε2

0)

The proof is now immediate from p∗LQ (j) = w(j)e−θδj .

The reliability of this algorithm was tested by comparison to the numerically

stable, naive convolution based algorithm (NC) in [Hertz and Stormo, 1999] on a

typical range of parameters as described in Table 2.3. We found that in all 1600

cases the combination of bagFFT and sFFT is in agreement with the results from

NC to at least 11 decimal places and the theoretical bounds (from Claim 2.4 and

analogous to Corollary 2.1) guarantee accuracy to at least 5 decimal places.

43

2.8 Conclusion and Future Work

The bagFFT algorithm is asymptotically the fastest algorithm for computing the

exact p-value of the G2 statistic for goodness-of-fit tests. We complement the

algorithm with a rigorous analysis of the accumulation of roundoff errors in it.

Moreover, we show empirically that for a wide range of parameters these error

bounds are useful to guarantee the quality of the computed p-value. We demon-

strate the utility of our approach by combining bagFFT and sFFT to provide a

fast, new algorithm for estimating the significance of sequence motifs. The bagFFT

algorithm is available at http://www.cs.cornell.edu/˜niranjan/.

We are still working on certain algorithmic refinements to bagFFT. In particu-

lar, we wish to optimize bagFFT for computing a single p-value. This is motivated

by Hirji’s algorithm, which as a network algorithm, is optimized for computing

a single p-value based on pruning strategies described in [Hirji, 1997] (another

strategy is described in [Bejerano et al., 2004]). This pruning is one of the main

reasons Hirji’s algorithm is still faster than bagFFT for smaller N . We are cur-

rently working on providing similar runtime gains for bagFFT. Our future goals

include designing a “stitched” algorithm that can choose among a range of existing

algorithms so as to be optimal for any given set of parameter values and a desired

level of accuracy. We would also like to explore the applicability of bagFFT for

Pearson’s X2 and for log-linear models, as well as a generalization to contingency

tables, as is the case for Baglivo et al.’s algorithm [Baglivo et al., 1992].

The bagFFT algorithm serves as another demonstration of the effectiveness of

the shifted-FFT technique [Keich, 2005] to accurately compute vanishingly small

p-values. In recent work, we have studied the applicability of this method to non-

parametric tests such as the Mann-Whitney as well.

44

BIBLIOGRAPHY

[Baglivo et al., 1992] Baglivo,J., Olivier,D. and Pagano,M. (1992) Methods for ex-act goodness-of-fit tests. Journal of the American Statistical Association, 87(418), 464–469.

[Bailey and Elkan, 1994] Bailey,T. and Elkan,C. (1994) Fitting a mixture modelby expectation maximization to discover motifs in biopolymers. In Proceedingsof the Second International Conference on Intelligent Systems for MolecularBiology pp. 28–36 AAAI, Menlo Park, California.

[Bejerano et al., 2004] Bejerano,G., Friedman,N. and Tishby,N. (2004) Efficientexact p-value computation for small sample, sparse and surprising categoricaldata. J. Comput. Biol., 11, 867–886.

[Cressie and Read, 1984] Cressie,N. and Read,T. (1984) Multinomial goodness-of-fit tests. J. R. Statist. Soc. B, 46, 440–464.

[Cressie and Read, 1989] Cressie,N. and Read,T. (1989) Pearson’s χ2 and the log-likelihood ratio statistic g2: a comparative review. International Statistical Re-view, 57 (1), 19–43.

[Dembo and Zeitouni, 1998] Dembo,A. and Zeitouni,O. (1998) Large DeviationTechniques and Applications. Darmstadt, Germany: Springer Verlag.

[Hertz and Stormo, 1999] Hertz,G. and Stormo,G. (1999) Identifying DNA andprotein patterns with statistically significant alignments of multiple sequences.Bioinformatics, 15, 563–577.

[Hirji, 1997] Hirji,K. (1997) A comparison of algorithms for exact goodness-of-fit tests for multinomial data. Communications in Statistics-Simulation andComputations, 26 (3), 1197–1227.

[Hoeffding, 1965] Hoeffding,W. (1965) Asymptotically optimal tests for multino-mial distributions. Annals of Mathematical Statistics, 36, 369–408.

[Keich, 2005] Keich,U. (2005) Efficiently computing the p-value of the entropyscore. Journal of Computational Biology, 12 (4), 416–430.

[Mehta and Patel, 1983] Mehta,C.R. and Patel,N.R. (1983) A network algorithmfor performing fisher’s exact test in r × c contingency tables. Journal of theAmerican Statistical Association, 78 (382), 427–434.

45

[Press et al., 1992] Press,W., Teukolsky,S., Vetterling,W. and Flannery,B. (1992)Numerical recipes in C. The art of scientific computing. Second edition,, Cam-bridge University Press.

[Rahmann, 2003] Rahmann,S. (2003) Dynamic programming algorithms for twostatistical problems in computational biology. In Proceedings of the Third In-ternational Workshop on Algorithms in Bioinformatics (WABI-03), (Benson,G.and Page,R.D.M., eds), vol. 2812, of Lecture Notes in Computer Science pp.151–164 Springer, Budapest, Hungary.

[Sadreyev and Grishin, 2004] Sadreyev,R.I. and Grishin,N.V. (2004) Estimates ofstatistical significance for comparison of individual positions in multiple sequencealignments. BMC Bioinformatics, 5 (106).

[Siotani and Fujikoshi, 1984] Siotani,M. and Fujikoshi,Y. (1984) Asymptotic ap-proximations for the distributions of multinomial goodness-of-fit statistics. Hi-roshima Math. J., 14, 115–124.

[Stormo, 2000] Stormo,G. (2000) DNA binding sites: representation and discovery.Bioinformatics, 16 (1), 16–23.

[Tasche and Zeuner, 2001] Tasche,M. and Zeuner,H. (2001) Worst and averagecase roundoff error analysis for fft. BIT, 41 (3), 563–581.

CHAPTER 3

COMPUTING THE SIGNIFICANCE OF AN UNGAPPED LOCAL

ALIGNMENT

3.1 Introduction

Finding local similarities among a set of sequences is a common task in compu-

tational biology. For example, by finding similarities within a set of promoters

from coregulated genes, one hopes to recover transcription factor binding sites

that guide the genes’ expression patterns in vivo. Given a set of sequences,

motif finding algorithms such as MEME [Bailey and Elkan, 1994] and CONSEN-

SUS [Hertz and Stormo, 1999] return a number of possible alignments in some

order of potential biological relevance. A critical part of any such study is for a

researcher to discriminate between local alignments that are simply random arti-

facts of the sample, and local alignments that are so improbable by chance that

they are likely to be biologically relevant.

An ungapped local alignment of length L of sequences from an alphabet with A

letters is typically summarized by its information content, or entropy [Stormo, 2000]

as follows. Let nij denote the number of occurrences of the jth letter in the ith

column of the alignment, and let n be the number of sequences in the alignment.

The entropy score, or information content, of the alignment is defined as

I :=L∑

i=1

A∑

j=1

nij lognij/n

bj,

where bj is the background frequency of the jth letter (typically, bj is the frequency

of the jth letter in the entire sample).1 The entropy score for a given column i of

1Strictly speaking, relative entropy is defined as I/n.

46

47

the alignment is defined, similarly, as:

I(i) :=

A∑

j=1

nij lognij/n

bj.

While this score can be used to rank more than one alignment in a given sample,

it cannot provide any direct information about an alignment’s significance, and in

particular cannot be used to compare two alignments of varying L and n. To assess

the significance of an alignment with entropy score s0, we rely on the alignment’s

p-value, which is the probability of seeing an entropy score of s0 or better under

the assumption that each of the L columns has n letters independently sampled

according to the background distribution {b1, . . . , bA}. If the p-value is near 1 then

the columns in the alignment are too similar to the background for the pattern to

be interesting, but if the p-value is near 0 then the alignment suggests a functional

site.

Let p denote the probability mass function (pmf) of the column score I(i) under

the hypothesis that the column is noise—in the sense that it was sampled from the

multinomial distribution described by the background probabilities {b1, . . . , bA}.

Assuming that the entropy score for each of the L columns in the alignment is an

independent random variable, the pmf of the alignment’s total entropy score I is

given by the L-fold convolution of p:

p∗L(s) := p ∗ · · · ∗ p︸︷︷︸L

:=∑

(s1,...,sL):s1+···+sL=s

p(s1) . . . p(sL). (3.1)

The p-value of an alignment with score s0 is therefore F ∗L(s0) :=∑

s≥s0 p∗L(s).

Unfortunately, to naively compute this requires traversing all s ≥ s0, which is

prohibitively expensive in practice because of the large number of possible values

of s. As a result, multiple alignment programs rely on approximations to compute

48

the p-value, striving for a balance between the time spent computing and the

accuracy of the result.

To determine if approximating the p-value computation introduces errors in

practice, Jones and Keich [Jones and Keich, 2005] modified the source code of

MEME (version 3.0.3) and CONSENSUS (version 6c, April 2001) to score arbitrary

alignments, bypassing each algorithm’s motif finding step. In this way they were

able to compare p-value estimates from different algorithms on a variety of different

alignments. Figure 3.1 shows the results from their experiments where a point at

(x, y) is plotted for an alignment with CONSENSUS E-value of x and a MEME

E-value of y. The E-value of an alignment with score s0 is the expected number

of alignments in the sample with the same n and L and with entropy score greater

than or equal to s0. It can be obtained from the p-value by multiplying by the

number of possible alignments in the sample. As can be seen in the figure, the

MEME E-value is consistently larger than the CONSENSUS E-value (which is

reliable in this region) by roughly two orders of magnitude. Jones and Keich

found that in at least one case, the true E-value indicates an expectation of 10

alignments with comparable score existing in the sample, while MEME reports an

expectation of 5000 alignments; it is conceivable that a researcher would arrive at

two different conclusions about the significance of the same alignment by relying

on the two estimates. Furthermore, they found at least two alignments of the

same size that had inconsistent E-values according to MEME: one alignment had

a lower entropy and also a lower E-value than the other (entropy of 13.583, E-

value of 1.725× 107 compared to entropy of 13.617, E-value 4.1716× 107) which

is clearly a contradiction. Neither the approximation methods discussed in this

chapter, nor the methods proposed in [Hertz and Stormo, 1999] demonstrate this

49

-4

-2

0

2

4

6

8

10

12

-4 -2 0 2 4 6 8 10 12

log 10

(MEM

E E-

valu

e)

log10(Consensus E-value)

y=x

Figure 3.1: A comparison of MEME E-values to CONSENSUS E-values

The comparison was done for L = 15 and n = 20 where the sequences are oflength 1000 each. Since the CONSENSUS E-values are accurate over the rangeof scores considered here, MEME clearly overestimates the E-value in nearlyevery case; for alignments with E-values smaller than or equal to 1 according toCONSENSUS, MEME may report an E-value as large as 100.

instability.

It is important to note that the E-values from CONSENSUS were calculated

using an algorithm (LD; see below) that is fast but at times inaccurate. An example

of the ratio of CONSENSUS p-values to the true p-values (as calculated by the

slower but accurate NC algorithm; see below) is shown in Figure 3.2. Since the

CONSENSUS-reported estimates can be up to two orders of magnitude off, this

chapter introduces a compromise that achieves nearly the accuracy of NC, but at

speeds comparable to LD.

[Hertz and Stormo, 1999] suggest two possible approximation techniques for

50

398 400 402 404 406 408 410 412 414 416 418−1

−0.5

0

0.5

1

1.5

2

Score (s)

10lo

g (

LD(s

)/NC(

s))

Figure 3.2: Graph of log10(LD(s)/NC(s))

This graph demonstrates how far off the CONSENSUS-reported p-value may befrom the value it estimates. The parameters for this graph are n = 20, A = 4,L = 10 and b = [0.2497, 0.2499, 0.2501, 0.2503]. The gaps in the graph indicateareas of unattainable entropy values.

51

calculating the p-value. The first, NC, replaces I(i) with its latticed cousin Iδ(i) :=

bI(i)/δc. In this case, the L-fold convolution of pδ (the pmf of Iδ(i)) can be

done more efficiently than the L-fold convolution of p and is used to approximate

it. A naive algorithm for computing the L-fold convolution on a lattice requires

O(L2M2) time, where M is the size of the lattice. Hertz and Stormo note that using

the Fast Fourier Transform (FFT) to perform the convolution would decrease the

running time to O(LM log(LM)); however, the numerical instability of the FFT

algorithm tends to wreak havoc on the computation’s accuracy for small values,

which is exactly the region we are most interested in when searching for motifs. The

second method they suggest, LD, uses large deviation theory to estimate the tail

of an exponentially shifted probability distribution. In practice this approximation

scheme works quite well except for a range of values near the maximal (or minimal)

score where it may be off by an order of magnitude or more. Nevertheless, LD

is nearly 200 times faster to compute than NC for L = 10, A = 4, n = 100 and

M = 16384 and is therefore the method used in the popular CONSENSUS tool.

As an alternative to NC, [Keich, 2005] proposes the sFFT algorithm to over-

come the numerical instability of the FFT for the L-fold convolution step and also

delineates explicit bounds on the accuracy of the result. Though this method has

lower complexity than NC, it is still somewhat time consuming on large sample

sizes. In Section 3.2, we present improvements to sFFT that give rise to the fastest

known algorithm that has accuracy comparable to NC. We then describe an opti-

mization, the cyclic-shifted-FFT technique, to produce the csFFT algorithm which

is more efficient for the computation of a single p-value, with speed comparable to

LD.

52

3.2 Methods

Following the treatment in [Keich, 2005], we introduce the shifted-FFT (sFFT)

algorithm. The primary bottleneck of the algorithm presented in that paper is the

computation of the probability mass function of one column’s entropy score, which

we show here can be done much more efficiently.

3.2.1 The Shifted-FFT (sFFT) algorithm

The L-fold convolution of an arbitrary vector v ∈ CM , written v∗L, can be com-

puted as follows. Let N = ML and extend v to N dimensions by padding it with

zeros. Define w ∈ CN as w(k) = [(Dv)(k)]L (where D and D−1 are the DFT

operator and its inverse respectively as defined in Section 2.3). Then v∗L is given

by v∗L(l) := (D−1w)(l).

The straightforward implementations of D and D−1 require O(N 2) time, but

using a recursive divide-and-conquer strategy results in the Fast Fourier Transform

(FFT) which takes O(N logN) time. If D and D−1 are the respective implemen-

tations of D and D−1, then as shown in Section 2.4, due to numerical errors D

and D−1 are not exactly the linear and mutually inverse operators that D and

D−1 are. Correspondingly, this naive FFT based computation cannot recover v∗L

accurately.

To avoid the problem of roundoff errors in computing the L-fold convolution of

pδ, we can emphasize the values of pδ in the region surrounding s0 by applying an

appropriate exponential shift prior to performing the L-fold convolution. Let

pθ,δ(s) := pδ(s)eθs/Mδ(θ), (3.2)

where Mδ(θ) = E[eθIδ(i)

]is the moment generating function of the lattice score

53

for one column. This particular form of shifting commutes with the convolution

operator which makes it easy to convert between p∗Lθ,δ and p∗Lδ . Note that, as s is

latticed, pθ,δ is an M -dimensional vector. We will use the notation pθ,δ(j) to refer

to the jth entry in that vector, and pθ,δ(s) to refer to the value of pθ,δ for entropy

score s.

Since θ is a parameter, we can choose it in such a way that for a given alignment

score s0 we get the maximal ”resolving power” relative to noise due to numerical er-

ror in the DFT. Intuitively, the most significant contributions to the p-value should

come from values of p∗Lδ close to s0, so we choose to center the mean of the shifted

pmf for one column at s0/L so that p∗Lθ,δ is centered about s0. This can be satis-

fied, based on a standard large deviation procedure [Dembo and Zeitouni, 1998],

by setting

θ0 = argminθ [logMδ(θ)− θs0/L] . (3.3)

Of course, in order to proceed with the convolution, we need an estimate for

the pmf of a single column. This could be performed by naively enumerating

all possible empirical distributions for a column. While this approach has the

advantage of being the most accurate, it requires O(nA−1) time. For small values

of A (as is the case for nucleotide sequences) this algorithm is still computationally

tractable. However, in our experiments we found that even for small values of n,

with A = 4, this stage tends to dominate the runtime of the algorithm.

An algorithm with runtime O(AMn2) to calculate the pmf on a lattice was

proposed by Hirji [Hirji, 1997]2. This particular algorithm produces the pmf over

the entire range of possible values in one execution by using dynamic programming.

An improvement to the runtime of the algorithm can be obtained by noting that for

2It was later rediscovered by Hertz and Stormo [Hertz and Stormo, 1999].

54

small values of n, the number of non-zero lattice points in the intermediate stages

of the calculation is small, which allows one to employ a list-based data structure

to reduce the runtime to O(AM ′n log(n)) where M ′ is significantly smaller than

M in practice (< 10 for the parameters in Table 3.1). The resulting algorithm is

more efficient than the original but it still suffers from the overhead of computing

with log-values 3 in order to avoid underflows.

The underflow conditions in Hirji’s algorithm arise because it multiplies and

adds terms of the form ra(n′) = ba

n′

/n′! (where ba is the background distribution

for a ∈ [1..A] and n′ ∈ [0..n]) that are exponentially small in n′. These terms are

used to recursively compute the vector pδ,a,n′, where

pδ,a,n′(j) =n′∑

n′′=0

ra(n′′) · pδ,a−1,n′−n′′(j − ja(n′′)) (3.4)

pδ,1,n′(j) =

n! · r1(n′) if j = ja(n′) and n′ ∈ [0..n]

0 otherwise

and ja(n′) = round(δ−1n′ log(n

′/nba

)). As is shown in [Hertz and Stormo, 1999],

pδ(j) = pδ,A,n(j) and so this procedure recovers the pmf pδ.

In order to avoid the use of logarithms in these computations we design the

following procedure: instead of computing with the ra’s we shift them to get

r′a(n′) = ra(n

′)eδja(n′)+n′(log(n)−1)

and perform the recursion in (3.4) using r′a’s. Let the corresponding result be p′δ.

We can then recover pδ based on the following claim:

Claim 3.1. pδ(j) = p′δ(j)e−δj−n(log(n)−1)

3Addition of log-values in a C program, for example, was found to be more than10 times slower than regular addition.

55

0

0.5

1

1.5

2

2.5

3

0 50 100 150 200 250 300 350 400

Runt

ime

(in se

cs)

Number of sequences (n)

Runtime comparison of shifted-Hirji with log-Hirji and bagFFT

shifted-Hirjilog-HirjibagFFT

Figure 3.3: Runtime comparison for versions of Hirji’s algorithm and bagFFT

The paramters for this comparison are A = 4, M = 1024 and the uniformbackground distribution.

The proof of this claim is based on simple induction using (3.4) and is therefore

omitted.

The shifted computation described above (shifted-Hirji) avoids the underflow

conditions of Hirji’s original algorithm. This is because, even though ra(n′) de-

creases exponentially with respect to n′, r′a(n′) remains approximately 1√

2πn′. For

practical values of n this improvement to Hirji’s algorithm does not introduce any

numerical errors into the result, and in some cases it may be more accurate than

relying on logarithms (we refer to this version as log-Hirji). As can be seen from

Figure 3.3, it also substantially improves the runtime (by more than a factor of 10,

on average). Note that, as shown in Chapter 2, bagFFT is asymptotically more

56

0

1

2

3

4

5

6

7

8

9

20 40 60 80 100 120 140 160

Runt

ime

(in se

cs)


Runtime comparison of shifted-Hirji with bagFFT for A=20

shifted-HirjibagFFT

Figure 3.4: Runtime comparison of shifted-Hirji and bagFFT for A = 20

The paramters for this comparison are M = 16384 and the uniform backgrounddistribution.

efficient than Hirji’s algorithm and can be combined with sFFT to compute the

significance of motif scores. However, as illustrated by Figure 3.3, shifted-Hirji

can be more efficient than bagFFT for A = 4 and the values of n that we are

typically interested in for finding transcription factor binding sites. The practical

advantages of bagFFT over shifted-Hirji are more evident for larger A (as in the

case of protein alignments) and n as is suggested by Figure 3.4.

The modified sFFT algorithm is shown in Figure 3.5. It is important to note

that the proof of correctness for the original sFFT algorithm [Keich, 2005] is triv-

ially extended to this case where the complete enumeration of the pmf is replaced

with the shifted-Hirji algorithm4. Thus, this version of the sFFT algorithm is

4The original bounds on the p-value are now replaced with looser bounds

57

The input to sFFT is:

• n, the number of sequences

• L, the number of columns in the alignment

• b1, . . . , bA, the background frequencies of the A letters

• M , the size of the lattice

• s0, the observed score

Given the input, sFFT:

1. Computes pδ, an estimate of pδ by using the shifted-Hirji algorithm.

2. Finds θ0 by numerically solving (3.3).

3. Computes pθ0,δ(s) according to (3.2).

4. Computes p∗Lθ0,δ by applying the FFT-based convolution to pθ0,δ(s).

5. Computes p∗Lδ (j) = p∗Lθ0,δ(j)e−θ0jδ+L log fMδ(θ0) for j0 ≤ j ≤ jmax,

where j0 and jmax are the lattice indices corresponding to s0 and

the maximum score smax.

6. Returns sFFT(s0) :=∑jmax

j=j0p∗Lδ (j).

Figure 3.5: The sFFT algorithm

faster than the original, yet just as reliable.

3.2.2 The Cyclic Shifted-FFT (csFFT) algorithm

The sFFT algorithm above can compute p-values for a range of possible alignment

scores, which is wasteful when all we need is a single p-value. Fortunately, most

of the mass of the shifted probability mass function p∗Lθ,δ arises from a restricted

range of possible s-values as Figure 3.6(a) suggests.

58

0 5 10 15x 104

0

1

2

3

4

5

6x 10−3

j

p θ,δ

∗ L (j)

pθ,δ∗ L

(L−1)M

(a) An example where p∗Lθ,δ has its es-

sential support in a narrow interval

defined by [(L−1)M,LM ]; s0 = 405.

0 5 10 15x 104

0

1

2

3

4

5

6x 10−5

j

p θ,δ

∗ L (j)

pθ,δ∗ L

(L−4)M

(b) Here, p∗Lθ,δ has its essential sup-

port in an interval larger than M ,

defined by [(L−4)M,LM ]; s0 = 350.

0 5 10 15x 104

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10−5

j

p θ,δ

∗ L (j)

pθ,δ∗ L

(L−7)M(L−2)M

L’M

(c) The support for p∗Lθ,δ is mainly in

an L′M interval for L′ = 5, but not of

the form [(L−L′)M,LM ]; s0 = 300.

Figure 3.6: The shifted pmf is 0 for much of the valid values of s

Here, M = 10000, n = 20, L = 15, b = [0.2499, 0.2501, 0.2497, 0.2503]. The“essential support” intervals here are described by indices for the latticed pmf.

59

We would like to avoid computing p∗Lθ,δ on those large intervals where it is practically

zero. To that end, consider the following cyclic sum of p∗Lθ,δ :

q(i) =∑

{j:j mod M=i}p∗Lθ,δ(j). (3.5)

In the example described in Figure 3.6(a), p∗Lθ,δ ≈ 0 for j /∈ [(L− 1)M,LM ]; there-

fore, it follows that q(i) (where i = j mod M) approximates p∗Lθ,δ(j) on the interval

j ∈ [(L − 1)M,LM ]. Since q is defined on a lattice of size M rather than on a

lattice of size LM , we can immediately save a factor of L, provided q is efficiently

computable. Since q is the cyclic convolution of pθ,δ it can be efficiently computed

by:

Claim 3.2. q = D−1M w, where w(k) =

[(DM pθ,δ

)(k)

]L.

The proof of this claim can be found in [Press et al., 1992]. The difference

between the formula above and its non-cyclic analog is the dimensionality of the

DFT operator: here it is M while the DFT operator previously had dimensionality

LM with pθ,δ appropriately padded with (L− 1)M zeros.

More generally, the essential support interval of p∗Lθ,δ may be of size L′M (for

example, see Figure 3.6(b)). Such an interval may also not be strictly of the form

[(L−L′)M,LM ]; instead being centered about s0 (for example, see Figure 3.6(c)).

In this case, rather than directly calculating

F ∗Lδ (j0) :=

∑

j≥j0p∗Lδ (j)

we approximate it with

F ∗Lθ (j0) :=

∑

j0≤j≤Jq(j)e−θ0jδ+L logM(θ0)

60

where J = min(j0 + L′M/2, jmax). This is justified by

∑

j0≤j≤Jq(j)e−θ0jδ+L logM(θ0)

≈∑

j0≤j≤Jp∗Lθ,δ(j)e

−θ0jδ+L logM(θ0)

≈∑

j≥j0

p∗Lθ,δ(j)e−θ0jδ+L logM(θ0)

An appropriate choice of L′ would ensure that, say, 95% of the mass of p∗Lθ,δ

lies in the interval of size L′M centered about j0. This would be relatively easy

if we had an explicit function for p∗Lθ,δ, but this is exactly the function we are

trying to estimate. Instead, we rely on the following formula for L′, under the

assumption that p∗Lθ,δ is roughly normally distributed (an assumption made by the

LD algorithm):

L′ :=

⌈kσ√LσθM

⌉. (3.6)

Here, σ2θ := Var pθ,δ, where pθ,δ is the integer lattice version of (3.2) and the

variance is computed on the lattice indices. Note that kσ√Lσθ = kσ

√Var(p∗Lθ,δ) so

an interval of size L′M centered about j0 extends roughly kσ/2 standard deviations

on each side. Thus, if we arbitrarily set kσ := 4, then (3.6) roughly ensures the

desired 95% condition under the assumption of normality.

3.2.3 Boosting θ

As observed in [Hertz and Stormo, 1999], when s approaches smax, θ increases

while σθ decreases. Thus, for s0 close to smax, if we increase or boost θ beyond

the computed θ0 = θ(s0) from (3.3), we reduce σθ. Since L′ depends linearly on

σθ (from (3.6)), such boosting can effectively decrease the runtime by reducing L′.

Another reason to boost θ for s near smax is that it reduces the error introduced

by approximating sFFT with the cyclic sum in csFFT, as shown next.

61

Claim 3.3. Let d = L′M , J ′ = min (j0 + d− 1, jmax) and j ′ ≡ j mod d. Then

F ∗Lθ (j0)− F ∗L

δ (j0) ≤∑

j0≤j≤J

∑

j′<j

p∗Lδ (j ′)e−θ0(j−j′)δ (3.7)

+∑

j0≤j≤J

∑

j′>j

p∗Lδ (j ′)(e−θ0(j−j′)δ − 1) (3.8)

F ∗Lδ (j0)− F ∗L

θ (j0) ≤∑

j0+d/2<j≤J ′

∑

j′>j

p∗Lδ (j ′) (3.9)

The proof of the claim is straightforward from the definitions and is therefore

omitted.

Suppose s is sufficiently close to smax so that j0 + d/2 > jmax. In that case the

right hand side of (3.9) vanishes leaving F ∗Lθ (j0) as an upper bound of the p-value,

F ∗Lδ (j0). Moreover, the term (3.8) vanishes as well and we are left with:

0 ≤ F ∗Lθ (j0)− F ∗L

δ (j0) (3.10)

≤∑

j0≤j≤jmax

∑

j′<j

p∗Lδ (j ′)e−θ0(j−j′)δ. (3.11)

This upper bound on the error decreases as θ0 increases, which supports our as-

sertion that boosting θ is beneficial for s close to smax.

One might be tempted to boost θ by a large amount, but while this would indeed

reduce the error in (3.11) it would have the unfortunate side effect of increasing

the numerical errors in the FFT (discussed at length in [Keich, 2005]).

An intermediate solution is to boost θ by adding

θboost = log(109)/((jmax − j0)δ). (3.12)

This solution can boost θ significantly and bring corresponding savings in runtime,

as well as reduce the error in (3.11). It is also designed (based on some assumptions

about p∗Lθ,δ) to still preserve the important entries of p∗Lθ,δ (for computing the p-value)

during the FFT. Finally, while this solution is heuristic, it works well in practice,

as is shown in Sections 3.3.1 and 3.3.2.

62

The input to csFFT is:

• n, the number of sequences

• L, the number of columns in the alignment

• b1, . . . , bA, the background frequencies of the A letters

• M , the size of the lattice

• s0, the observed score

Given the input, csFFT:

1. Computes pδ, an estimate of pδ by using the shifted-Hirji algorithm.

2. Finds θ0 by numerically solving (3.3).

3. Computes L′ according to (3.6) and using the default kσ = 4.

4. Boosts θ0 by (3.12) if j0 + L′M/2 > jmax.

5. Computes pθ0,δ(s) according to (3.2).

6. Computes p∗Lθ0,δ by applying the FFT-based cyclic-convolution to pθ0,δ(s)

with period L′M .

7. Computes p∗Lδ (j) = p∗Lθ0,δ(j)e−θ0jδ+L log fMδ(θ0) for j0 ≤ j ≤ J .

8. Returns csFFT(s0) :=∑J

j=j0p∗Lδ (j).

Figure 3.7: The csFFT algorithm

63

The cyclic shifted FFT algorithm (csFFT) with boosting is shown in Figure 3.7.

For typical values of L, the csFFT algorithm is simultaneously more accurate than

and comparable in speed to LD.

3.3 Results

3.3.1 Runtime characterization

Assuming that the time-limiting step of sFFT is the calculation of the FFT itself,

csFFT is roughly L/L′ times faster than the sFFT algorithm described in the

previous section. Interestingly, the savings of L/L′ varies with s0: the speedup for

values of s0 near the center of the distribution is modest, while the best gains occur

near the ends of the range of possible s-values. This follows from the fact that as

s0 approaches smax (or smin), the corresponding σθ goes to 0 yielding a smaller L′

in (3.6). In any case, the complexity of csFFT is lower than that of sFFT: by (3.6)

the complexity of the FFT step is now O(√LM log(

√LM)).

We conducted tests to verify that csFFT is indeed more efficient than sFFT.

Since sFFT and csFFT differ mainly in the convolution step of the algorithm

where the running times are roughly linear in L and L′ respectively, we focus on

the growth of L′ in terms of L. Figure 3.8(a) demonstrates that if we take the

average value of L′ over the range of s values, it grows roughly as√L when all

other parameters are fixed. In addition, the average value of L′ is roughly constant

for different ba’s from Table 3.1 (based on a test with L = 10 and n = 10) and

decreases as n increases5 (see Figure 3.8(b)). Furthermore, we found that boosting,

when it is applicable, gives substantial runtime gains; halving the runtime in many

5In practice the runtime increases with n as we have to increase M proportion-ally to maintain the granularity of the lattice.

64

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

0 5 10 15 20 25 30 35 40 45 50

Ave

rage

L’

Length of alignment (L)

Average L’ as a function of L

Avg L’

(a) n = 10 and M = 16384.

2.4

2.45

2.5

2.55

2.6

2.65

2.7

2.75

2.8

2.85

2.9

5 10 15 20 25 30 35 40 45 50

Ave

rage

L’


Average L’ as a function of n

Avg L’

(b) L = 10 and M = 16384.

Figure 3.8: Average values of L′ versus L and N

The results were obtained using the perturbed Uniform ba and the averages aretaken over 100 evenly spaced values of s.

cases. Finally, since csFFT relies on a substantially faster convolution than sFFT,

we found that for tests with large n, small L and s close to smax the runtime of the

algorithm is no longer dominated by the time for the convolution. For example, in

a test case with n = 20, L = 15, ba = Uniform, M = 16384 and s = 380, csFFT

takes 0.09s to compute the answer, of which 0.01s is spent in the shifted-Hirji

step, 0.07s is spent computing the shift, and 0.01s is required for the cyclic-FFT

(L′ = 2). New techniques to reduce the time spent in computing the shift could

be a useful addition to the algorithm.

3.3.2 Error analysis

For each combination of parameter values in Table 3.1 we tested 20 roughly evenly

spaced values for s and separately another set of 100 points lying in the tail of the

pmf. Because we have latticed s, the p-value of s has an inherent lattice error, as

65

Table 3.1: Range of test parameters

Parameter Values

L 5, 10, 15, 30

n 2, 5, 10, 15, 20, 50

ba Uniform, Sloped, Blocked,

Perturbed Uniform

Uniform refers to the case where b = [0.25, 0.25, 0.25, 0.25], Sloped refers tob = [0.1, 0.2, 0.3, 0.4], Blocked refers to b = [0.2, 0.2, 0.3, 0.3] and PerturbedUniform refers to b = [0.2497, 0.2499, 0.2501, 0.2503].

discussed in [Keich, 2005]. For any given value of s, sFFT(s) and csFFT(s) fall

within a small range; the true value falls somewhere in between the minimum and

maximum values in that range. The bounds for the p-values computed by csFFT

was then compared to the provably reliable bounds from sFFT. In all cases that

we tested, we found that the bounds agreed to more than 1 decimal place. The

cases with the worst disagreement were usually found to be for values of s close to

the average of the pmf where the p-values are large and therefore not very relevant

in most applications.

3.3.3 Stitching LD and csFFT

The csFFT algorithm is simultaneously more accurate than and comparable in

speed to LD. For example, for L = 10, n = 100, ba = Uniform, M = 16384 and

s = 380, CONSENSUS’s p-value computation required 0.32s, while csFFT required

0.20s with L′ = 3. Admittedly, this is a somewhat biased example as n = 100 is

likely larger than typical problems. For the example in the previous section, on the

66

Table 3.2: Runtime comparison between csFFT and LD

n L s Runtime for csFFT Runtime for LD

(in seconds) (in seconds)

40 5 200 0.04 0.06

15 5 100 0.01 0.01

15 30 600 0.05 0.01

40 30 600 0.12 0.06

40 5 260 0.02 0.06

The comparisons were made using the Uniform ba (see Table 3.1) and with Q setto 16384.

other hand, LD is faster by a factor of 4. We present a few more examples in Table

3.2. In general, LD is faster than csFFT for small n and large L and also for values

of s that are away from the tail with larger L′. We can exploit this by designing a

heuristic rule that switches to LD for appropriate values of n and L. In designing

a switching criterion we also need to consider the approximation errors inherent

to LD; an example is given in the introduction in which LD gives a very poor

approximation. [Hertz and Stormo, 1999] present an empirical test that can be

used to gauge the reliability of the LD-based normal approximation. Essentially, if

s is less than 3 standard deviations (of the shifted pmf) from smax then the normal

approximation is no longer reliable. We calculated the observed error of the LD

method in the range defined by this test for the set of parameters in Table 3.1 and

found that in all cases the error ratio was less than 1.24, corresponding to less than

24% error. We can therefore use this test in conjunction with csFFT to yield an

algorithm that is efficient and accurate over a larger range of n and L values.

67

3.4 Conclusion

Accurate methods for estimating the p-value of an alignment score are critical

in aiding the discovery of biologically meaningful signals from sets of related se-

quences. While existing tools provide estimates, it is clear that some estimates are

better than others. The method employed by MEME is overly pessimistic about an

alignment, which could conceivably lead to missed signals. While the method used

by CONSENSUS is more accurate, it can still improperly estimate the p-value.

Two methods were presented in this chapter that work well in practice for DNA

motifs. While the first (sFFT) is not quite as fast as LD, it is significantly faster

than NC, has bounded error estimates, and returns p-values for a range of entropy

scores. The second, csFFT, is comparable in speed to LD and is empirically more

accurate, but like LD returns a p-value only for a single entropy score.

The algorithms described in this chapter provide a general method for the

computation of p-values for ungapped alignments. Extending these methods to

account for gapped alignments is, however, an important and interesting topic

for future research. The methods described in this chapter can also be used for

applications other than motif finding. These tools may be helpful wherever a sta-

tistical significance of a multiple alignment is desired; for example, in the problem

of profile-profile alignment or in the analysis of protein families.

68

BIBLIOGRAPHY


[Bailey and Elkan, 1994] Bailey,T. and Elkan,C. (1994) Fitting a mixture modelby expectation maximization to discover motifs in biopolymers. In Proceedingsof the Second International Conference on Intelligent Systems for MolecularBiology pp. 28–36 AAAI, Menlo Park, California.

[Dembo and Zeitouni, 1998] Dembo,A. and Zeitouni,O. (1998) Large DeviationTechniques and Applications Second edition,, Springer-Verlag, NY, USA.

[Jones and Keich, 2005] Jones, N.C., and Keich, U. (2005) Personal Communica-tion.

[Hertz and Stormo, 1999] Hertz,G. and Stormo,G. (1999) Identifying DNA andprotein patterns with statistically significant alignments of multiple sequences.Bioinformatics, 15, 563–577.

[Hirji, 1997] Hirji,K. (1997) A comparison of algorithms for exact goodness-of-fit tests for multinomial data. Communications in Statistics-Simulation andComputations, 26 (3), 1197–1227.

[Keich, 2005] Keich,U. (2005) Efficiently computing the p-value of the entropyscore. Journal of Computational Biology, 12 (4), 416–430.

[Keich and Nagarajan, 2004] Keich,U. and Nagarajan,N. (2004) A faster reliablealgorithm to estimate the p-value of the multinomial llr statistic. In Proceedingsof the fourth Workshop on Algorithms in Bioinformatics (WABI-04).

[Press et al., 1992] Press,W., Teukolsky,S., Vetterling,W. and Flannery,B. (1992)Numerical recipes in C. The art of scientific computing. Second edition,, Cam-bridge University Press.


CHAPTER 4

REFINING MOTIF FINDERS WITH E-VALUE CALCULATIONS

4.1 Introduction

The problem of motif finding can be summarized as scanning a given set of se-

quences for short, well-conserved ungapped alignments. Most of the interest in this

problem comes from its application to identification of transcription factor binding

sites, and of cis-regulatory elements in general. These in turn are important to

the fundamental problem of understanding the regulation of gene expression. This

motivated the design of several popular motif finding tools that search for short

sequence motifs given only an input set of sequences (see [Tompa et al., 2005] for

a recent comparative review).

Most existing motif finders can be divided into two classes depending on whether

they model a motif with a consensus sequence or with a position weight matrix

(PWM or profile). Commonly used motif finders that fall in this latter category

include MEME [Bailey and Elkan, 1994], CONSENSUS [Hertz and Stormo, 1999]

and the various approaches to Gibbs sampling (for example [Lawrence et al., 1993,

Neuwald et al., 1995, Hughes et al., 2000]). This chapter concentrates on improv-

ing this popular class of finders.

Profile-based motif finding algorithms typically try to optimize the entropy

score, or information content of the reported alignment (as defined in Chapter

3). In order to assign statistical significance to the reported motifs as well as to

be able to compare alignments of different widths and depths Hertz and Stormo

introduced the notion of a motif E-value. Introduced originally in this context as

the “expected frequency” [Hertz and Stormo, 1999], the E-value is the expected

69

70

number of random alignments of the same dimension that would exhibit an entropy

score that is at least as high as the score of the given alignment. When the E-value

is high, one can have little confidence in the motif prediction, and conversely when

the E-value is low, one can have more confidence in the prediction. It is computed

by multiplying the number of possible alignments by the p-value of the alignment

(which is the subject of Chapter 3). The latter is defined as the probability that

a single given random alignment would have an entropy score ≥ the observed

alignment score.

While the E-value is the chosen figure-of-merit for evaluating motifs in popular

motif finders such as MEME and CONSENSUS it is not directly optimized for.

For example, in MEME E-values are only computed after the EM-algorithm com-

pletes its optimization and are only used for significance evaluation and possibly

for comparing motifs of different widths. Similarly, when CONSENSUS looks to

extend a sub-alignment (matrix) in its greedy search strategy, it chooses the one

that optimizes the entropy rather than the E-value1. One of the main reasons for

this separation between optimization and significance analysis is that E-values are

significantly more expensive to compute than entropy scores. Even the relatively

fast (and potentially inaccurate as shown in Chapter 3) large-deviation method

that CONSENSUS employs for computing the E-value can tax an optimization

procedure at an unacceptable level.

The discussion above raises two questions:

• Cost aside, can a more direct optimization of the E-value improve our results?

• Can we compute the E-values efficiently so that they can be optimized for?

1These two approaches would generally differ if the lengths of the sequences arenot identical.

71

This chapter lays out arguments advocating a positive answer for both questions.

We begin by describing a new technique, memo-sFFT (based on the techniques

in Chapter 3), that allows us to accurately and efficiently compute multiple E-

values. We then present the Conspv program that uses the memo-sFFT system

to implement a CONSENSUS style motif finder that directly optimizes E-values.

The Conspv program generalizes readily to the problem of finding motifs of un-

known widths and is functionally equivalent to a combination of CONSENSUS

and WCONSENSUS [Hertz and Stormo, 1999]. We show based on experiments

on synthetic data that Conspv can significantly improve over WCONSENSUS

for finding motifs of unknown widths. As further evidence to the advantage of

a more direct optimization of the E-values, we describe the Gibbspv algorithm

[Ng and Keich, 2006]. This new variant of the Gibbs-sampling algorithm is es-

pecially effective when searching for motifs of unknown width by incorporating

memo-sFFT to efficiently consider E-values in its optimization procedure. In our

experiments on synthetic datasets, Gibbspv clearly outperforms other motif finders

for finding motifs of unknown width.

It should be noted that GLAM [Frith et al., 2004] is conceptually quite similar

to Gibbspv as both rely on a Gibbs sampling procedure to optimize an overall mea-

surement of statistical significance. However GLAM uses a different significance

analysis and as we show below in our tests it is less successful than both Conspv

and Gibbspv.

4.2 Efficiently computing E-values

In a typical application of CONSENSUS in the experiments described in Section 4.6

about 108 alignments are compared. CONSENSUS compares them using entropy

72

scores that can be computed in O(wn + wA) time from scratch, where w is the

width of the motif, n is the number of sequences and A is the alphabet size (in this

chapter a DNA alphabet of 4 letters). Note that the typical case in CONSENSUS

is actually when the score is updated while extending a sub-alignment and this

takes O(w) time. In comparison, computing E-values reliably can be many orders

of magnitude more expensive if done naively. An efficient algorithm for reliably

computing a single p-value (a crucial time-limiting step for computing E-values,

see [Hertz and Stormo, 1999]) can typically take ≈ 0.01s for the test sets in Section

4.6. This can be prohibitively expensive if incorporated into Conspv (see Table

4.1).

A partial solution to this problem is to memoize the results. However, we can

do even better by relying on algorithms that can compute p-values for a range

of scores ([Hertz and Stormo, 1999], [Keich, 2005]). While a single application of

these algorithms can be more than 10 times slower, this is compensated for by

the fact that they compute a range of p-values that can be stored and reused.

We exploit this feature to extend the sFFT algorithm in [Keich, 2005]2 to the

memo-sFFT algorithm shown in Figure 4.1.

In addition we also implemented the following optimizations to memo-sFFT

for its use in Conspv and Gibbspv:

• sFFT computes an array pδ (the pmf of a single column) as the first step in

its calculations and this array is independent of the value of w. We utilize

this fact and modify sFFT to save and reuse this array across runs.

• The sFFT algorithm requires a lattice size Q (or equivalently a step size δ)

2As shown there, the sFFT algorithm is much more efficient than the numericalmethod in [Hertz and Stormo, 1999].

73

memo-sFFT(n, w, I)

1 if accuracy[n][w][I] < B

2 then (pvalue sFFT , accuracy sFFT )← sFFT(n, w, I)

3 for each I

4 do if accuracy[n][w][I] < accuracy sFFT [I]

5 then pvalue[n][w][I]← pvalue sFFT [I]

6 accuracy[n][w][I]← accuracy sFFT [I]

7

8 return pvalue[n][w][I]

Figure 4.1: The memo-sFFT algorithm

Here I is the latticized entropy score [Keich, 2005], B is a desired upper-boundon the relative error (that we set to 10−2) and each entry of array accuracy isinitialized to a value ≥ B. Note that we use the term accuracy here to refer tothe rigorous bound on the relative roundoff error that can be computed forp-values computed using sFFT [Keich, 2005].

74

that acts as a knob to trade accuracy for speed. We found that setting δ

to 0.02 provides good accuracy3 while being efficient for the experiments in

Section 4.6.

• As observed in [Keich, 2005] the sFFT algorithm can typically be used to

recover the entire range of p-values (for a given n and w) in a small number

(≤ 3) of invocations. In particular, we found that a single well-chosen call

to sFFT (θ = 1) can provide a good starting point for memo-sFFT and we

implemented this as part of our system.

As can be seen from the results in Table 4.1, Conspv based on memo-sFFT is

indeed much more efficient than a version that computes E-values based on the

large-deviation method in CONSENSUS. For the sets described in Section 4.6, we

found that less than half a minute is spent in pre-computing p-values in Conspv

and the amortized cost of a call to memo-sFFT is essentially that of a table-

lookup. The memo-sFFT system therefore opens up the possibility of designing

better motif finders that directly optimize the E-value and we present two such

algorithms in the next two sections.

4.3 Optimizing for E-values - Conspv

The Conspv program in its simplest form adapts the CONSENSUS algorithm with

the difference being that it uses E-values rather than entropy scores to compare

alignments. More specifically, we implemented a version of the CONSENSUS

algorithm under the OOPS model [Bailey and Elkan, 1995] and the -pr2 option

(save the best alignment extension for each alignment). We also employed the

3Note that the p-value is computed as the geometric mean of the bounds re-turned by sFFT.

75

Table 4.1: The advantage of using memo-sFFT

Experiment memo-sFFT CONSENSUS

CRP-100 3.0 7.5

CRP-500 3.5 32.7

CRP-1000 4.2 65.1

CRP-5000 9.5 316.6

The columns memo-sFFT and CONSENSUS report the runtime (in seconds) forConspv implemented with memo-sFFT and the large-deviation method inCONSENSUS respectively, for the various test sets. The CRP-X sets contain 18sequences of length 108 and X specifies the number of alignments saved byConspv in its beam search (corresponding to the -q option for CONSENSUS).

memo-sFFT system described in Section 4.2 to compute E-values. While the cost

of computing E-values using this system is essentially a constant, this can still be

a significant time penalty for Conspv. We therefore optimized its running time

further by not computing the E-value for alignments that have too low an entropy

score to be worthy of consideration. This is determined by keeping a lower bound

for the entropy score based on alignments that do not make it into the list of best

alignments4.

When run on a set of sequences that have identical lengths the CONSENSUS

algorithm (using the entropy score to compare alignments) can be seen as a greedy

algorithm to optimize the E-value. However, on a set of sequences of varying

lengths this is no longer the case. For such a set, CONSENSUS only optimizes the

E-value indirectly. To test if this makes a difference to the performance of CON-

SENSUS, we compared it to Conspv on some of the test sets described in Section

4Note that CONSENSUS is a beam search algorithm that maintains a list ofbest alignments seen so far.

76

Table 4.2: Tests on sequences of varied length

Experiment CONSENSUS TPs Conspv TPs

COMBO1 174 195

COMBO2 146 153

FIFTY1 62 145

FIFTY2 30 41

The values reported here are the number of tests in which the reported motif hasa significant overlap with the implanted motif (see Section 4.6 for details) out ofa total of 200 tests.

4.6. As can be seen from the results presented in Table 4.2, Conspv can significantly

improve on the results of CONSENSUS. The improvement is most pronounced on

the sets COMBO1 and FIFTY1 corresponding to sets where the sequence lengths are

more diverged.

A major advantage of Conspv is that it lends itself naturally for searching over

multiple motif widths. Since alignments are compared using E-values there is no

need for heuristics such as the one in WCONSENSUS [Hertz and Stormo, 1999].

To exploit this, we implemented a version of Conspv that takes a range of widths

to search over as input5. The single width version of Conspv is then extended as

follows: instead of ranking an alignment by its E-value for a given fixed width, we

now rank by the optimal E-value for widths in the given range. A naive alternative

approach to this (that we refer to as WECons) is to run CONSENSUS for each of

the widths in the given range and choose the motif with the optimal E-value.

The advantage of Conspv over WECons derives from the fact that the cost

of running Conspv for r different widths (where the largest width is wmax) is

5A generalization to a set of allowed widths can also be easily implemented.

77

much less than the cost of r runs of CONSENSUS. This is essentially because a

majority of the running time of CONSENSUS is spent in evaluating extensions

to alignments and this can be done in O(wmax) time in both CONSENSUS and

Conspv. In practice, Conspv is a bit more than twice slower compared to a single

run of CONSENSUS. This improved runtime is exploited by Conspv as follows:

when searching for the best motif CONSENSUS maintains a list (of size q specified

by the user) of the best motifs seen so far. When searching over w different widths,

CONSENSUS would maintain w such lists independently. In Conspv a single, much

larger list can be maintained for the same total runtime. This enables Conspv to

devote more time looking at the promising motifs regardless of their width and

thus do a better search of the motif space.

To assess the relative performance of Conspv, WECons and WCONSENSUS6

we compared them over several synthetic datasets (see Section 4.6 for details).

The results were qualitatively similar across the datasets and a couple of them

are presented in Table 4.3. As can be seen in Table 4.3, Conspv can improve

substantially over WCONSENSUS and WECons in finding motifs that overlap the

implanted motifs in our datasets. This is also uniformly true across the various

overlap scores that we measured and for various thresholds of overlap (as indicated

by Figure 4.2).

4.4 E-value based improvements of the Gibbs sampler

Having established that consideration of E-values can improve the performance of

CONSENSUS we next look at the Gibbs sampler. In particular we look at the prob-

6Since WCONSENSUS does not allow the user to directly specify a rangeof widths we instead varied the bias parameter (the -s option) over the range{0.5, 1, 1.5, 2.0} (as suggested in [Hertz and Stormo, 1999]).

78

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

50

100

150

num

ber o

f dat

aset

s

overlap−coverage

conspvWEConsWConsensus

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

20

40

60

80

100

120

140

160

overlap−accuracy

num

ber o

f dat

aset

s

conspvWEConsWConsensus

Figure 4.2: Performance of CONSENSUS based motif finders

The histogram here shows the number of datasets as a function of the overlapscore for the COMBO3 experiment and the various motif finders in Table 4.3.

79

Table 4.3: Comparison of CONSENSUS based motif finders

Experiment Finders TPs Cov Acc

WCONSENSUS 52 38 29

COMBO3 WECons 49 40 42

Conspv 89 76 74

WCONSENSUS 57 46 43

GAP1 WECons 46 39 38

Conspv 74 63 60

In the “TPs” column we report the number of tests where there is significantoverlap with the implanted motif out of a total of 200 tests. Also, the “Cov” and“Acc” columns report the number of tests in which the overlap is a substantialfraction of the implanted (overlap-coverage) and reported (overlap-accuracy)motifs respectively. For details on the experiments and the overlap scores seeSection 4.6.

lem of unknown motif width. Lawrence et al. [Lawrence et al., 1993] considered

several criteria for choosing the right width from multiple runs of their sampler, a

run for each possible width. The criterion they eventually recommend is termed the

“information per parameter” which is the incomplete-data log-probability ratio (22

in [Lawrence et al., 1993]) divided by the number of free parameters ((A− 1)w).

Below we refer to this version of Gibbs as WGibbs.

An obvious alternative to WGibbs in the spirit of WECons, which we call

WEGibbs, is to choose the run with the width that optimizes the E-value instead

of the original information per parameter. Using again the tests described in

Section 4.6 we found that WEGibbs does a significantly better job than WGibbs

at detecting the implanted motifs (see Table 4.4 and Figure 4.3). The next logical

step is to ask whether a Gibbs analogue of Conspv that would more intimately

80

Table 4.4: Comparison of Gibbs samplers


WGibbs 20 15 17

COMBO3 WEGibbs 125 117 118

Gibbspv 146 137 136

WGibbs 17 13 11

GAP1 WEGibbs 77 64 60

Gibbspv 95 82 79

The comments following Table 4.3 apply here as well.

link the E-values to the optimization procedure can further improve these results.

Gibbspv [Ng and Keich, 2006], a new variant of the Gibbs sampling procedure, is

an attempt to answer this question.

The original Gibbs-sampling motif finder begins each run by picking a random

starting position in each sequence in the data set. The algorithm then sequentially

applies the following two-step procedure to each of the sample sequences. The

predictive update step computes a motif model Θ based on the current chosen set

of starting positions7. The sampling step in turn randomly selects new candidate

starting positions in the current sequence with probability proportional to the

likelihood ratio of the position given the current model Θ. Each iteration of the

Gibbs sampler consists of applying the aforementioned two-step procedure once to

each of the input sequences.

7The model Θ is inferred from the starting positions by the rule Θij =cij+bi

N−1+P

j bj,

where cij is the count of letter j in the i-th sequence of the alignment and bj is ana priori chosen pseudocount to avoid 0 probabilities.

81

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

20

40

60

80

100

120

140

160

180

overlap−coverage

num

ber o

f dat

aset

s

WEGibbsgibbspvWGibbs

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

20

40

60

80

100

120

140

160

180

overlap−accuracy

num

ber o

f dat

aset

s

WEGibbsgibbspvWGibbs

Figure 4.3: Performance of Gibbs samplers

The histogram here shows the number of datasets as a function of the overlapscore for the COMBO3 experiment and the various Gibbs samplers in Table 4.4.

82

Gibbspv cycles through a user specified number of iterations (default is -C=2)

at the end of which five E-values are computed corresponding to the following five

alignments8:

• The alignment of the currently chosen sites (width w)

• The alignments of first/last w − 1 columns of the currently chosen sites

• The alignment generated by adding the column to the right/left of the cur-

rently chosen sites (width w + 1)

The algorithm then chooses the alignment with the best (smallest) E-values and

continues as before. As in the original Gibbs sampler, if no improvement to the

entropy score is detected in a specified number of iterations (-L) the program starts

a new run.

Table 4.4 and Figure 4.3 confirm that by incorporating the E-values into its

sampling strategy Gibbspv is better at detecting the implanted motifs in our ex-

periments. In addition, as can be seen from the results in Table 4.5, Gibbspv can

be substantially better than existing algorithms such as GLAM [Frith et al., 2004]

and MEME [Bailey and Elkan, 1994] for finding motifs with unknown width.

4.5 Conclusion

In this chapter we have demonstrated the utility of E-value calculations for design-

ing better motif finders. For this purpose, memo-sFFT can serve as an accurate tool

for efficiently computing a large number of E-values. In particular, for finding mo-

tifs of unknown width, the memo-sFFT based gibbs sampler, Gibbspv can outper-

8Subject to the condition that the considered alignment is well defined andwithin the specified range of widths.

83

Table 4.5: Comparison of Gibbspv with MEME and GLAM


Gibbspv 129 124 123

COMBO4 GLAM 77 73 72

MEME 33 21 23

Gibbspv 125 116 116

GAP2 GLAM 88 82 81

MEME 61 51 49

The comments following Table 4.3 apply here as well.

form several existing motif finders. Exploring the use of E-values for designing bet-

ter motif finder for other motif models (such as ZOOPS [Bailey and Elkan, 1995])

can be a fruitful avenue for future research.

4.6 Methods

To test the various motif finders we constructed synthetic datasets with implanted

motifs as follows: independent sequences with the specified lengths were sampled

by choosing symbols at random from the four letter DNA alphabet according to

a uniform, independent background frequency. A position was chosen uniformly

at random from each sequence and an instance of a given profile Θ, generated

as described below, was inserted in that position. The profiles used (see Table

4.6) are represented as a position weight matrix, a 4× w array of numbers where

Θij denotes the frequency of letter i in column j in all aligned instances of Θ.

Since we wanted to have control over the implanted motifs the instances were

84

essentially generated by permuting the columns of the alignment. Each column of

the alignment matched the corresponding column of the profile up to discretizing

effects.

For each of the experiments that we conducted, 200 datasets were generated for

a given profile. The various motif finders were then run with parameter settings

that allowed them to take from 9-10 minutes, to place them on an equal footing.

Note that we were unable to do this for MEME as it does not employ any pa-

rameters that allow the control of running time. In all the experiments, MEME

ran for much less than 9 minutes. This factor should be taken into account when

judging the generally poor performance of MEME compared to the other motif

finders. The details for the experiments that we conducted can be found in Table

4.8. Also, the various profiles used are shown in Table 4.6.

In general, the length of the sequences and the implanted profiles were cho-

sen such that the motif finders we considered would have a non-trivial percentage

of failures (i.e. datasets where they pick motifs with no overlap with the im-

plants). These hard motif finding problems provide good test sets for discrim-

inating between the various motif finders. Finally, an estimate of overlap for

each data set and for each motif finder was computed in the following manner:

Let an be the position of the implanted motif instance in the nth sequence, let

an be the position of the motif reported by a motif finder and let w and w be

the respective widths of the motifs. Then we define the following overlap scores:

overlap-coverage = overlap-x(a, a, w), overlap-accuracy = overlap-x(a, a, w) and

overlap = min{overlap-coverage, overlap-accuracy} where

overlap-x(a, a, x) = max|i|<x

2

{x− |i|x· | {n : an = an + i} |

N

}(4.1)

and N is the number of sequences in the dataset. To report a significant overlap

85

between the implanted and the reported motif (true positives or TPs) we used

a threshold of 0.1 for the overlap score. Also, for overlap-coverage and overlap-

accuracy (corresponding to the columns “Cov” and “Acc” in Tables 4.3, 4.4 and

4.5) we used a threshold of 0.3.

86

Table 4.6: The profiles used in our experiments

COMBO FIFTY GAP

A C G T A C G T A C G T

1 0.95 0.00 0.00 0.05 0.50 0.00 0.00 0.50 0.70 0.10 0.10 0.10

2 0.00 0.50 0.50 0.00 0.00 0.50 0.50 0.00 0.00 0.70 0.30 0.00

3 0.70 0.10 0.10 0.10 0.50 0.50 0.00 0.00 0.10 0.00 0.90 0.00

4 0.00 0.70 0.30 0.00 0.50 0.00 0.50 0.00 0.10 0.10 0.10 0.70

5 0.50 0.00 0.00 0.50 0.50 0.50 0.00 0.00 0.00 0.70 0.00 0.30

6 0.25 0.25 0.25 0.25 0.00 0.50 0.50 0.00 0.30 0.20 0.30 0.20

7 0.95 0.00 0.00 0.05 0.00 0.50 0.00 0.50 0.25 0.25 0.20 0.30

8 0.25 0.25 0.25 0.25 0.00 0.50 0.00 0.50 0.00 0.50 0.50 0.00

9 0.70 0.10 0.10 0.10 0.50 0.00 0.50 0.00 0.10 0.10 0.70 0.10

10 0.00 0.50 0.00 0.50 0.00 0.50 0.50 0.00 0.00 0.70 0.30 0.00

11 0.00 0.70 0.00 0.30 0.50 0.50 0.00 0.00 0.10 0.10 0.10 0.70

12 0.70 0.10 0.10 0.10 0.00 0.50 0.50 0.00 0.00 0.90 0.10 0.00

13 0.00 0.50 0.50 0.00 0.00 0.50 0.00 0.50 0.30 0.00 0.70 0.00

87

Table 4.7: The parameter sets used in our experiments

Parameter Set Finder Parameters

SINGLE-WIDTH CONSENSUS -L 13 -c0 -q 4000

Conspv 13 6000

WCONSENSUS -c0 -q 200

WECons -c0 -q 120

Conspv 4000

MULTI-WIDTH WGibbs -d -n -t80 -L150

WEGibbs -d -n -t80 -L150

Gibbspv -t350 -L400

GLAM -1 -n10000 -r55 -z -a9 -b17

OTHER MEME -mod oops -nmotifs 1 -dna -minw 9

-maxw 17 -text -maxsize 1000000

Gibbspv -t250 -L400

For the MULTI-WIDTH and OTHER tests the motif finders were used to search formotifs with widths in the range [9, 17].

88

Table 4.8: Experiment details

Experiment Profile Parameter Set Sequences

COMBO1 COMBO SINGLE-WIDTH 20 of length 500 & 20 of length 2500

COMBO2 COMBO SINGLE-WIDTH 20 of length 1000 & 20 of length 2000

FIFTY1 FIFTY SINGLE-WIDTH 20 of length 500 & 20 of length 2500

FIFTY2 FIFTY SINGLE-WIDTH 20 of length 1000 & 20 of length 2000

COMBO3 COMBO MULTI-WIDTH 30 of length 1000

GAP1 GAP MULTI-WIDTH 30 of length 1000

COMBO4 COMBO OTHER 40 of length 1500

GAP2 GAP OTHER 40 of length 1500

See Table 4.6 and 4.7 for details about the profiles and parameter sets used.

89

BIBLIOGRAPHY

[Bailey and Elkan, 1994] Bailey,T. and Elkan,C. (1994) Fitting a mixture modelby expectation maximization to discover motifs in biopolymers. In Proceedingsof the Second International Conference on Intelligent Systems for MolecularBiology pp. 28–36, Menlo Park, California.

[Bailey and Elkan, 1995] Bailey,T. and Elkan,C. (1995) The value of prior knowl-edge in discovering motifs with meme. In Proceedings of the Third InternationalConference on Intelligent Systems for Molecular Biology pp. 21–29 AAAI Press,Menlo Park, California.

[Frith et al., 2004] Frith,M.C., Hansen,U., Spouge,J.L. and Weng,Z. (2004) Find-ing functional sequence elements by multiple local alignment. Nucleic Acids Res,32 (1), 189–200.

[Hertz and Stormo, 1999] Hertz,G. and Stormo,G. (1999) Identifying DNA andprotein patterns with statistically significant alignments of multiple sequences.Bioinformatics, 15 (7-8), 563–77.

[Hughes et al., 2000] Hughes,J., Estep,P., Tavazoie,S. and Church,G. (2000) Com-putational identification of cis-regulatory elements associated with groups offunctionally related genes in Saccharomyces cerevisiae. J Mol Biol, 296 (5),1205–14.

[Keich, 2005] Keich,U. (2005) Efficiently computing the p-value of the entropyscore. J Comput Biol, 12 (4).

[Lawrence et al., 1993] Lawrence,C., Altschul,S., Boguski,M., Liu,J., Neuwald,A.and Wootton,J. (1993) Detecting subtle sequence signals: a Gibbs samplingstrategy for multiple alignment. Science, 262 (5131), 208–14.

[Nagarajan et al., 2005] Nagarajan,N., Jones,N. and Keich,U. (2005) Computingthe P-value of the information content from an alignment of multiple sequences.Bioinformatics, 21 Suppl 1 (ISMB 2005), i311–i318.

[Ng and Keich, 2006] Ng, P., and Keich, U. (2006) Personal Communication.

[Neuwald et al., 1995] Neuwald,A., Liu,J. and Lawrence,C. (1995) Gibbs motifsampling: detection of bacterial outer membrane protein repeats. Protein Sci,4 (8), 1618–32.


90

[Tompa et al., 2005] Tompa,M. et al. (2005) Assessing computational tools for thediscovery of transcription factor binding sites. Nat Biotechnol, 23 (1), 137–44.

CHAPTER 5

SEQUENCE-BASED DOMAIN PREDICTION

5.1 Background

One of the first steps in analyzing proteins is to detect the constituent domains or

the domain structure of the protein. A domain is considered as the fundamen-

tal unit of protein structure, folding, function, evolution and design [Rose 1979,

Lesk & Rose 1981, Holm & Sander 1994]. It combines several secondary structure

elements and motifs, not necessarily contiguous, which are packed in a compact

globular structure. It is commonly believed that a domain can fold independently

into a stable three dimensional structure and that it has a specific function. A pro-

tein may be comprised of a single domain or several different domains, or several

copies of the same domain. It is the domain structure of a protein that determines

its function, the biological pathways in which it is involved and the molecules it

interacts with.

Detecting the domain structure of a protein is a challenging problem. Given

the protein sequence there are no clear signals or signs that indicate when one

domain ends and another begins. Structural information can help in detect-

ing the domain structure of a protein. Domain delineation based on structure

is currently best done manually by experts and the SCOP domain classification

[Murzin et al. 1995], which is based on extensive expert knowledge, is an excellent

example. However, structural information is available for only a small portion of

the protein space. Therefore, there is a strong interest in detecting the domain

structure of a protein directly from the sequence.

In our study we define a domain to be a continuous sequence that corre-

91

92

sponds to an elemental building block of protein folds - a subsequence that is

likely to be stable as an independent folding unit. As such we believe that this

building block was first formed as an independent protein with a specific acquired

function. In the course of evolution, the domain might have been combined with

additional domains to perform other, possibly more complex, functions. However,

if the domain indeed existed at some point as an independent unit then it is likely

that traces of the autonomous unit might exist in other database sequences, pos-

sibly in lower organisms. Thus a database search can sometimes provide us with

ample information on the domain structure of a protein. For example, the his-

togram and profile of sequence matches one can obtain from a database search

may help to detect domain boundaries [Yona & Levitt 2000b, Kuroda et al. 2000,

George & Heringa 2002]. However, one should be cautious in analysing database

matches in search for such signals. One possible difficulty arises from the fact that

pairs of sequence domains may appear in many related sequences, thus hinder-

ing the ability to discern the two apart. Furthermore, mutations, insertions and

deletions blur domain boundaries and make it hard to distinguish a signal from

background noise.

5.1.1 Related studies

Previous methods for sequence-based domain detection could be roughly classified

into five categories: (i) Methods based on the use of similarity searches and knowl-

edge of sequence endpoints to delineate domain boundaries using heuristics. Meth-

ods like MKDOM [Gouzy et al. 1999], Domainer [Sonnhammer & Kahn 1994], DI-

VCLUS [Park & Teichmann 1998] and DOMO [Gracy & Argos 1998] fall in this

category. These methods were designed to partition all the proteins in a database

93

into domains but they are in general less accurate due to their heuristic na-

ture. (ii) Methods that rely on expert knowledge of protein families to con-

struct models like HMMs and Artificial Neural Networks to identify other mem-

bers of the family. Some of the methods that fall in this category include PFam A

[Sonnhammer et al. 1997, Bateman et al. 1999], Murvai et al [Murvai et al. 2001],

TigrFam [Haft et al. 2001] and SMART [Ponting et al. 1999]. These methods are

considerably more accurate but are restricted by their ability to make predictions

only for well studied families. (iii) Methods that try to infer domain boundaries

by using sequence information to predict tertiary structure first. SnapDragon

[George & Heringa 2002] and Rigden’s covariance analysis [Rigden 2002] are ex-

amples of this approach. These methods use novel sources of information but

are computationally expensive. (iv) Methods that use multiple alignments to

predict domain boundaries such as PASS [Kuroda et al. 2000] and Domination

[George & Heringa 2002]. (v) Other methods, that do not fall into any of the pre-

vious categories (clustering sequence alignments [Guan & Du 1998], Miyazaki et

al [Miyazaki et al. 2002] and domain guess by size [Wheelan et al. 2000]). A more

detailed description of the five categories follows.

5.1.1.1 Methods based on similarity search

Of the similarity search based algorithms, MKDOM is conceptually the simplest

and most efficient and is currently employed in the generation of the ProDom

database. The algorithm works on the assumption that the smallest repeat-free

sequence fragment in a database is likely to correspond to a single domain (all

fragments smaller than a threshold are automatically removed from the database.)

Significant matches with the fragment are extracted from all sequences in the

94

database and the process is repeated on the new database until no more fragments

remain. The Domainer algorithm works by doing an all-vs-all blast search to

identify segment pairs with high degree of homology. These segment pairs are then

iteratively merged based on overlap measures to form Homologous Segment Sets

(HSSs) and links are maintained between HSSs that have fragments that follow

each other sequentially in a protein sequence. The resulting HSS graph is then

partitioned into domains (sets of HSSs) using sequence endpoints and information

about cycles in the graph as domain transition signals. The DIVCLUS program

starts with an all-vs-all search as well but it uses SSEARCH or FASTA to get

gapped alignments. The resulting pairs are then clustered using single linkage

clustering. Finally, DIVCLUS attempts to split the clusters into smaller clusters

using various measures of overlap between sequences in combination with some

thresholds (for example overlap of at least 30 amino acids that covers at least 70%

of the shorter of the two sequences.) The DOMO algorithm clusters sequences into

groups by comparing their amino acid and dipeptide composition. Each cluster

is represented by one sequence and the representatives are compiled into a suffix

tree. This tree is self-compared to detect ungapped local sequence similarities.

The resulting pairs form the seed anchors which are intersected with other anchors

based on either the presence of a significantly overlapping common subsequence

or common position relative to another anchor. The anchor merging process is

accompanied by a controlled interval intersection process which finally determines

the domain boundaries for the proteins.

95

5.1.1.2 Methods based on expert knowledge

The PFam database [Bateman et al. 1999] combines manual and automatic ap-

proaches to classify proteins into domain families. The database is split into two

parts, PFam A, that is composed of families generated from high quality multi-

ple alignments and verified using structural and functional information with sub-

stantial manual involvement and PFam B that is generated using the Domainer

algorithm on the rest of the sequence database. No specific rules are used to de-

fine domain boundaries other than the judgment of human experts and structural

information (when available) from SCOP about the domain structure of proteins.

The SMART classification is similar to PFam A in that it is based on HMMs

constructed from manually-checked, high-quality multiple alignments with the dif-

ference being that SMART focuses on domains occurring in signaling proteins.

The TigrFam database is constructed using the same methodology as in PFam A

and SMART but is geared towards the identification of functionally similar subse-

quences rather than domains. Instead of using HMMs to learn models for domain

families the work by Murvai et al is based on the use of artificial neural networks

for this purpose. The data used to construct the models is in the form of statistics

gathered from BLAST comparisons with members and non-members of the various

domain families.

5.1.1.3 Methods that use predicted 3D information

Recent studies on sequence based domain delineation have also explored other

sources of information to detect domain boundaries. The SnapDragon method

works by first generating many ab-initio 3D model structures of a protein, using

the hydrophobicity information in multiple alignments and predicted secondary

96

structure information in Monte-Carlo folding simulations. Domain boundaries for

each of these 3D models are then computed based on structural considerations as

described in [Taylor 1999] and finally the consistency between the definitions for

the various models is used to partition the protein into domains. Rigden’s paper on

covariance analysis uses information from the calculation of correlated mutation

values for alignment columns to predict contacts in a protein. The predicted

contact information is then used to construct a contact profile where local minimas

in the profile are used to predict domain boundaries.

5.1.1.4 Methods based on multiple alignments

Domination and Pass are multiple alignment based algorithms. Domination is an

iterative algorithm that uses PSI-Blast to do a database search and generate an

initial pairwise alignment based multiple alignment. The distribution of N and C

termini in the alignment are then used to identify potential domains. The putative

domains are possibly merged if there is high correlation between the participating

sequences and then used to generate true multiple alignments. Profiles based on

these alignments are used with PSI-BLAST for the next round of database search

and this process is iterated to convergence to get domain definitions. Pass uses

profiles of sequence counts to locate positions where there is a substantial change

in sequence participation. These positions are then paired up to define domains.

5.1.1.5 Other methods

CSA (Clustering Sequence Alignments) represents sequences as 0-1 vectors based

on whether or not they are similar to the sequences in the databases. The sequences

are then clustered by constructing an MST on the all-vs-all graph. This method

97

does not give explicit domain definitions but may indicate possible domain families.

In the work by Miyazaki et al, the amino acid composition of the protein sequence

for a window of positions is used as input to train a neural network to detect

linker sequences in proteins. The DGS system uses domain size distribution and

architecture of previously characterized proteins to make the most likely guess for

a protein based solely on the length of the protein.

5.1.2 The current status

5.1.2.1 Methodology

Despite the large number of studies, the task of constructing an accurate and

efficient general-purpose domain detection system that works solely on sequence

information is still an open problem. While methods like SMART and TigrFam

are accurate, they require careful manual inspection and provide predictions for a

small subset of the sequence database. On the other side of the spectrum, methods

like DOMO and ProDom are fully automatic and give predictions for nearly all

proteins in the sequence database, but are less accurate. In this chapter we suggest

a novel approach that incorporates many of the salient features of earlier systems

into a probabilistic framework that is extensible and is based on rigorous analysis

of information sources in order to predict domain boundaries with high accuracy

and coverage.

5.1.2.2 Evaluation

There is no fixed, universally accepted set of rules for partitioning a protein into

its constituent domains. Therefore it is hard to assess the quality of domain pre-

dictions by any of the above algorithms. In the absence of a common framework

98

for analyzing the quality of domain predictions, the various works that we have

mentioned above have relied on a variety of qualitative and quantitative evalu-

ation criteria, external resources and manual analysis to verify domain bound-

aries and study the capabilities of their systems. For example, the quality of

domain predictions in DOMO is analyzed by taking domain annotations in PIR

[George et al. 1996] and SwissProt [Bairoch & Apweiler 1999] as being the stan-

dards of truth and by comparing the predictions to ProDom predictions. However,

their analysis is based only on a few selected examples. Others, such as Domination

and Rigden’s covariance analysis, run a more extensive evaluation based on com-

parisons with structure-based domain definitions as in SCOP [Hubbard et al. 1999]

but they did not evaluate the capabilities of other methods with this setup.

The diversity of evaluation criteria has made it impossible to objectively com-

pare the various methods for domain prediction. Here we propose and use a com-

mon framework to evaluate the various methods. This framework is based on using

definitions from the SCOP database and as a more rigorous subset, its intersec-

tion with the CATH database [Orengo et al. 1997] as the standard of truth. In

addition we devise scores that can be used in a uniform and unbiased fashion to

evaluate the accuracy and coverage of the various methods.

This chapter is organized as follows. We first describe the data set, scores

and our learning methodology in detail. We then present the results of testing

our method on a large collection of proteins with known structures and compare

our predictions to structure based domain definitions as well as to other sequence

based domain partitioning methods. We conclude with a few examples where our

predicted domains seem to suggest a plausible alternative to manual classification.

99

5.2 Methods

Given a query sequence, our algorithm starts by searching a large sequence database

and generating a multiple alignment of all significant hits. The columns of the mul-

tiple alignment are analyzed using a variety of sources to define scores that reflect

the domain-information-content of alignment columns. Information theory based

principles are employed to maximize the information content. These scores are then

combined using a neural network to label single columns as core-domain or bound-

ary positions with high accuracy. The output of the artificial neural network is

then post-processed to smooth and refine predictions while considering local infor-

mation from multiple columns. Finally, we introduce the domain-generator model

that uses global information about the distribution of domain sizes and sequence

divergence to test multiple hypotheses, filter out positions that are incorrectly pre-

dicted as boundary positions and output the most likely partition. An overview

of our method is depicted in Figure 5.1. We now turn to describe our method in

detail.

5.2.1 The data sets

5.2.1.1 The query data set

In the absence of general rules or principles that define domain boundaries, one

must rely on existing knowledge of protein domains to devise a reliable and ac-

curate methods for automatic domain detection. This knowledge, in the form of

complete protein chains and their partition into individual domains, can be used

to both train and test our method. One of the most extensive collections of pro-

tein domains is the one provided by the SCOP classification of protein structures

100

Multiple AlignmentSequence Termination

Correlation

Contact Profile

Entropy

Secondary Structure

Physio−Chemical Properties

Neural Network

11111111111111011111111010001111110100100000001000011000111111111111111

post−processing

Final Predictions

hypothesis evaluation (domain generator model)

Putative Predictions

Exon Boundaries

Seed Sequenceblast search

blast searchIntron Exon

Protein Data

DNA Data

Figure 5.1: Overview of our domain prediction system

101

[Hubbard et al. 1999]. This classification has a complicated hierarchy with 7 fold

classes, several hundred folds and more than one thousand protein families. It

is built by the careful manual curation of Dr. Alexei Murzin. The domains in

this database are defined from PDB records [Westbrook et al. 2002]. Each PDB

structure is manually partitioned into the component domains, based on their

compactness, the contact area with other parts of the protein and resemblance to

existing domains and then classified into families, superfamilies, folds and classes.

To train and test our method we selected complete protein chains from PDB,

searched the database and generated multiple alignments. About half of these

alignments with their corresponding domain structure as defined by SCOP were

used for training. The other half was used for testing.

Our initial dataset was the set of protein sequences in the PDB database as of

May 2002 with 35,184 protein chains, and 11,969 non-identical sequence entries.

All sequences shorter than 40 amino acids and fragments of longer sequences were

eliminated leaving 11294 sequences. Of sequences that are more than 95% identical

only a single representative was retained, yielding a total of 4,810 valid queries.

5.2.1.2 Alignments

Each one of the 4810 queries was searched against a composite non-redundant

database that contains 933,075 unique sequence entries. The database is composed

from 96 different databases among which are SwissProt, TrEMBL, PIR, PDB, DBJ,

GenBank, REF, PATAA, PRF and the complete genomes of 78 organisms. All en-

tries that are documented as fragments (according to at least one source database)

were eliminated, leaving a total of 693,912 non-fragmented entries. The alignment

was created in two phases. First, the query was searched against the non-redundant

102

database using BLAST [Altschul et al. 1997] and the related sequences were com-

piled into a database (a different database for each query sequence). In the second

phase, the query was searched against this smaller database, using PSI-BLAST

[Altschul et al. 1997] until convergence. Of these alignments, fragmented queries

were eliminated and only alignments with more than 20 hits were kept. Finally,

the query sequences were grouped into clusters (using the ProtoMap clustering

algorithm [Yona et al. 1999] with a conservative E-value threshold) and from each

group only one representative was selected (the one with the maximal number

of database aligned sequences). The final set of queries consisted of 3,140 PDB

sequences, with their corresponding alignments. Alignments are represented as a

sequence of alignment columns with each one being associated with one position

in the seed sequence (insertions with respect to the seed sequence are processed as

described in Section 5.2.2.3).

It is important to note that we did not try to refine the alignments by applying

other multiple alignment algorithms. Our goal was to develop a tool that can take

the output from a database search and immediately partition the query sequence

into domains, based on this information, while tolerating noise and misaligned

regions. However, an application of more sophisticated alignment algorithms can

help in refining the alignment and improving the quality of the predictions.

5.2.1.3 Domain definitions

The domain definitions were retrieved from the SCOP database, version 1.57 as

of May 2002. Of the 11969 unique entries in PDB, 9479 are listed in SCOP. After

removing inconsistent entries (identical chains with different domain definitions or

inconsistent lengths) we were left with 9185 entries. Of the 3,140 PDB queries,

103

IDELIQVMFTQQGVKLKKFGHFGLVMTKVVRWRVV

SCOP Domains

Boundary PositionsDomain Positions Domain Positions

x x

Figure 5.2: Domain and boundary positions

3,039 were documented in this list, with the number of domains ranging from 1 to

7. In a final pruning step, protein chains that are less than 90% covered by SCOP

domains are eliminated. In the final data set we retained all of the 605 multi-

domain proteins and 576 single domain proteins (one-fourth of all single domain

proteins) to ensure an equal representation of both.

For each protein chain we defined the domain positions to be the positions

that are at least x residues apart from a domain boundary. Domain boundaries

are obtained from SCOP definitions where for a SCOP definition of the form

(start1, end1)..(startn, endn) the domain boundaries are set to (endi + starti+1)/2

as in Figure 5.2. All positions that are within x residues from domain boundaries

are considered boundary positions. This process allows us to classify all the

positions in the proteins being considered as domain or boundary positions.

5.2.2 The domain-information of an alignment column

To quantify the likelihood that a sequence position is part of a domain, or at the

boundary of a domain we defined several measures based on the multiple alignment

that we believe reflect structural properties of proteins and would therefore be

informative of the domain structure of the seed protein. While some of these

measures are more directly related to structural properties than others, none of

104

these measures actually rely on structural information, as our goal was to devise

a novel technique that can suggest domain delineation from sequence information

alone.

5.2.2.1 Conservation measures

Multiple alignments of protein families can expose the core positions along the

backbone that are crucial to stabilize the protein structure, or play an important

functional role (as in the active site or in an interaction site). These positions

tend to be more conserved than others and strongly favor amino acids with similar

and very specific physio-chemical properties, because of structural and functional

constraints.

Amino acid entropy: One possible measure of the conservation of an alignment

column is given by the entropy of the corresponding distribution (Figure 5.3).

For a given probability distribution P over the set A of the 20 amino acids P =

(p1, p2, . . . , p20)t, the entropy is defined as

Ea(P) = −20∑

i=1

pi log2 pi

This is a measure of the disorder or uncertainty we have about the type of amino

acid in each position. In information theory terms, the entropy is the average

number of bits needed to encode an arbitrary member of A. For a given alignment

column, the probability distribution P is defined from the empirical counts, after

adding pseudo counts as described in [Henikoff & Henikoff 1996].

Class entropy: Quite frequently one may observe positions in protein families

that have a preference for a class of amino acids, all of which have similar physio-

chemical properties. The amino acid entropy measure is not effective in such cases

since it ignores amino acid similarities. An entropy measure based on suitably

105

Low Entropy High Entropy

Figure 5.3: Consistency measures

defined classes may capture positions with subtle preferences towards classes of

amino acids. We tried two different classifications that are motivated by different

considerations. The first classification was adopted from [Ferran et al. 1994] and

is based on clustering residues according to similarity scores from a statistical

score matrix. The classes that are define are hydrophobic (MILV), hydrophobic

aromatic (FWY), neutral and weakly hydrophobic (PAGST), hydrophilic acidic

(NQED), hydrophilic basic (KRH) and cysteine (C). The second classification is

basically an attempt to group the amino acids into small chemically similar groups

(Linda Nicholson, personal communication). The classes obtained as a result are

sulfur (CM), simple aliphatic (AL), side-chain restrictive aliphatic (IV), aromatic

(FWY), hydroxyl (ST), amide (NQ), acidic (ED), basic (KRH), proline (P) and

glycine (G). This classification worked better than the first and therefore was

chosen as the underlying classification for our class entropy measure.

Given the set C of amino acid classes and the empirical probabilities (with

pseudo counts) P the class entropy is defined in a similar way to the amino acid

entropy

Ec(P) = −∑

i∈Cpi log2 pi

Evolutionary pressure: The class entropy measure is one possible solution to the

106

aforementioned problem. However, it does not utilize all the prior information we

have about amino acid similarities. A better entropy measure would consider the

mutual information (similarity) of the amino acids. To the best of our knowledge,

this problem has never been addressed directly before. A possible extension may

generalize upon the results of Csiszr [Csiszr]. Alternatively, we suggest the use

of a measure that estimates the evolutionary pressure in an alignment column by

calculating the evolutionary span, approximated by the sum of pairwise similarities

of amino acids in a column. Specifically, if the number of sequences participating

in an alignment column k is n then the span of this column is defined as

Span(k) =2

n(n− 1)

n∑

i=1

∑

j<i

s(aik, ajk)

where aik is the amino acid in position k of sequence i and s(a, b) is the similarity

score of amino acids a and b according to a scoring matrix such as BLOSUM50

[Henikoff & Henikoff 1992].

5.2.2.2 Consistency and correlation measures

Since protein domains are believed to be stable building blocks of protein folds, it

is reasonable to assume that all appearances of a domain in database sequences

will maintain the domain’s integrity. However, domains may be coupled with

other domains and therefore a simple pairwise sequence alignment (or multiple

pairwise alignments) will not be informative. Integrating the information from

multiple sequences can generate a strong signal, indicative of domain boundaries

by detecting changes in sequence participation and evolutionary divergence. We

tested several different measures. These measures quantify the correlation and

consistency of neighboring columns in an alignment.

107

High Correlation Low Correlation

Figure 5.4: Correlation measures

Consistency: This simple coarse-grained measure is based on sequence counts.

The measure is defined as the difference in the number of sequences in a column

and the average of the surrounding columns in a window of size w. If ck is the

sequence count in position k then

Consistency(k) = |ck −1

2w

∑

i6=k,|i−k|≤wci|

Asymmetric correlation: This is a more refined measure that considers the

consistency of individual sequences and sums their contributions. To measure

the correlation of two columns we first transform each alignment column into a

binary vector of dimension n (the number of sequences in the alignment) with 1’s

signifying aligned residues and 0’s for gaps. Given two binary vectors ~u and ~v their

asymmetric1 correlation (bitwise AND) is defined as

Corra(~u,~v) =< ~u,~v >=

n∑

i=1

ui · vi

High correlation values reflect consistent sequence participation while low correla-

tion values signal a region of ambiguous sequence participation and possible domain

boundaries (see Figure 5.4).

1Note that this measure is asymmetric in how it deals with gaps and residues.

108

Symmetric correlation: the asymmetric correlation measure does not reward for

sequences that are missing from both positions. However, these may reinforce a

weak signal based only on participating sequences. The symmetric correlation mea-

sure corrects this by using bitwise XNOR when comparing two alignment columns,

i.e.

Corrs(~u,~v) =

n∑

i=1

δ(ui, vi)

where δ is the delta function δ(x, y) = 1 ⇐⇒ x = y

To enhance the signal and smooth random fluctuations the contributions of

all positions in a local neighborhood around a sequence position are added, and

all correlation measures for an alignment column are calculated as the average

correlation over a window of size w centered at the column (the parameter w is

optimized, as described in Section 5.2.4).

Sequence termination: sequence termination is a strong signal of a domain

boundary. However, in a multiple alignment it is not necessarily indicative of a

true sequence termination. Although we eliminated all sequences that are doc-

umented as fragments from our database, the sequence may still be a fragment

of a longer sequence without being documented as such. Moreover, the termi-

nation may be premature as end loops are often loosely constrained and tend to

diverge more than core domain positions. These diverged subsequences may be

omitted from the alignment if they decrease the overall similarity score. Therefore

the sequence termination signal may be misleading if used simple-mindedly. To

reduce the sensitivity to sparse signals due to the aforementioned problems with

sequence termination, we consider all participating sequences in a position with

their E-values (that indirectly indicate alignment’s reliability). For every position

we calculate right and left termination scores, based on sequences that terminate

109

and originate from that position respectively, by taking the sum of the log of the

corresponding E-values. For example if an alignment position has n sequences, of

which c terminate at that position and the E-values of the corresponding align-

ments are e1, e2, . . . , ec then the left termination score is defined as

Eleft termination = log(e1 · e2 · · · · · ec)

Finally the left and right termination scores are smoothed over a window and

then combined through multiplication (joint termination) and addition (combined

termination) to get two different sequence termination based scores (our experi-

ments showed that these scores did better than the use of left and right termination

scores for neural network training).

5.2.2.3 Measures of structural flexibility

Regions of substantial structural flexibility in a protein often correspond to domain

boundaries where the structure is usually exposed and less constrained. We define

two different measures that may help us quantify this aspect.

Indel entropy: In a multiple alignment of related sequences, positions with

indels with respect to the seed sequence indicate regions where there is a certain

level of structural flexibility. The larger the number of insertions and the more

prominent the variability in the indel length at a position the more flexible we

would expect the structure to be in that region. We define the indel entropy based

on the distribution of indel lengths as

Eg(P) = −∑

i

pi log2 pi

where the pi are the various indel lengths seen at a position.

110

Correlated mutations: Another source of information about the structural flex-

ibility of a position can be obtained from the profile of predicted contacts in a pro-

tein. For each sequence position we count the number of pairwise contacts between

residues that reside on opposite sides of that position (see also [Rigden 2002]). Min-

imas in the profile correspond to regions where fewer interactions occur across these

sequence positions, implying relatively higher structural flexibility and suggesting

a domain boundary.

Contacts between residues in a protein are usually predicted based on correlated

mutations. The correlated mutation score between two columns is defined as in

[Pazos et al. 1997]. Specifically, the correlation coefficient for two positions k and

l is defined as

Corrm(k, l) =1

n2

n∑

i=1

n∑

j=1

(s(aik, ajk)− < sk >)(s(ail, ajl)− < sl >)

σk · σl

where aik is the amino acid in position k of sequence i and s(a, b) is the similarity

score of amino acids a and b according to the scoring matrix. The term < sk >

is the average similarity in position k and σk is the standard deviation. Here n is

the number of sequences that participate in both columns.

To predict a contact based on a correlated mutation score one needs a reliable

statistical significance measure to discern true correlations from random coinci-

dental regularities. To assess the statistical significance of correlated mutation

scores we calculated the correlation score for a large collection of random align-

ment columns2. Based on the distribution of the random scores we associate a

z − score with each correlated mutation score. If the average correlated mutation

2Random columns are generated by choosing a root residue at random andmutating it according to transition probabilities, derived from the BLOSUM50matrix, to generate the other residues in the column.

111

-2

-1

0

1

2

3

0 50 100 150 200

Zsco

re

Sequence Position (lines mark domain boundaries)

Contact Profile Score

Figure 5.5: Predicted contact profile

score for random columns is µ and the standard deviation is σ then the z-score of

a correlated mutation score r is defined as zscore(r) = r−µσ

We used the correlated mutation information to design two types of scores.

In the first case we considered correlated mutation values that were larger than

those in the random distribution as indicating contacts. The number of contacts

across every position is then normalized by the total number of possible contacts

to generate a contact profile. The other score was based on considering all the

values as contacts but weighting them by the z-score to get a weighted profile. An

example of a contact profile is given in Figure 5.5.

Beyond structural integrity, correlated mutations provide another source of

evidence for the domain structure of a protein from an evolutionary point of view.

Positions that are strongly correlated through evolution imply that the sequence

in between must have evolved in a coordinated manner as one piece. As such,

the sequence qualifies as a building block and it is less likely to observe a domain

boundary in between.

112

Calculating all correlated mutations is prohibitive for large alignments3. We

experimented with sampling of columns in an attempt to reduce the computation

time but noticed that the resulting profile can be qualitatively very inaccurate.

The sampling of rows on the other hand seems to have a marginal affect on the

correlated mutation calculations and so we imposed a limit of 100 sequences for

the columns, resorting to uniform sampling when the size of columns is bigger.

5.2.2.4 Residue type based measures

Physio-chemical properties of proteins may also help in predicting domain bound-

aries since they tend to have different characteristics around domain transition

points than in domain core positions. For example, hydrophobic residues tend to

cluster inside domain cores with hydrophilic residues occupying more exposed loca-

tions in a protein structure and therefore more likely to be in inter-domain regions.

Similarly, certain amino acids such as cystines and prolines are crucial in defin-

ing protein structure and therefore tend to occur in different frequencies in core

domain and inter-domain regions of a protein. The value of considering residue

composition in detecting domain boundaries is also demonstrated in the work done

by Miyazaki et al [Miyazaki et al. 2002]. In order to exploit these sources of infor-

mation we defined several measures; for hydrophobicity, molecular weight and for

the amino acids cysteine, valine, proline and glycine, all believed to be instrumen-

tal in defining protein structure. In addition we also used the Rasmol classification

of amino-acids to create a set of non-redundant classes that we use as measures

(acyclic [ARNDCEQGILKMSTV], aliphatic [AGILV], aromatic [HFWY], buried

3Calculating all-vs-all correlated mutations is an O((mn)2) task for an align-ment of length m with n sequences. For a typical alignment of length 200 with500 sequences this means on the order of (200 ∗ 500)2 = 1010 computations. Thistakes roughly three hours for our implementation on a pentium III 1Ghz machine.

113

[ACILMFWV], hydrophobic [AGILMFPWYV], large [REQHILKMFWY], nega-

tive [DE], positive [RHK] and small [AGS]). For each measure, the score of an

alignment column is defined as the average of all residue scores, where residue

scores are defined in the range 0 to 1. Hydrophobicity and molecular weight

residue scores were adopted from [Black & Mould 1991] and class scores were sim-

ply defined by the presence (score 1) or absence (score 0) of the residue in the

class.

5.2.2.5 Predicted secondary structure information

Protein structure is often studied at the level of secondary structure. Most inter-

domain regions are composed of loops while beta strands tend to form sheets that

constitute the core of protein domains. Alpha helices and beta sheets in proteins

are relatively rigid units and therefore domain boundaries rarely split these sec-

ondary structure elements. Indeed, in the study by [Sowdhamini & Blundell 1995]

a domain delineation algorithm was developed that was based on the clustering

of secondary structure units. This algorithm was applied to proteins of known

structure and used the available structural information to define the secondary

structure elements. However, useful information regarding the secondary struc-

ture of a protein can be obtained even when the structure is unknown. We used

the neural network based program PSIPRED [McGuffin et al. 2000] to predict the

secondary structure of the seed protein. The neural network confidence values in

the range 0-1 were then used as alpha helix (alpha), beta strand (beta) and coiled

region (coil) measures.

114

5.2.2.6 Intron-exon data

It is well known that the alternative splicing mechanism is used extensively in

higher organisms to generate multiple mRNA and protein products from the same

DNA strand. This mechanism raises an interesting combinatorial problem. By

sampling (and sometimes shuffling) the set of exons encoded in a DNA sequence,

the cell generates different proteins that share different numbers of exons.

Intron-exon data at the DNA level is believed to be correlated with domain

boundaries [Gilbert & Glynias 1993, Gilbert et al. 1997]. As building blocks, do-

mains are believed to have evolved independently. Therefore it is likely that each

domain has a well defined set of exons associated with it. If the product pro-

tein is a multi-domain protein we expect exon boundaries to coincide with domain

boundaries.

The Intron-exon data was derived from the EID database [Saxonov et al. 2000].

Only genes that were experimentally determined (based on the header information)

were included in our analysis (a total of 25,130 sequences, and 21,042 entries af-

ter eliminating redundancy). Each seed sequence was compared with all the EID

sequences, and all significant ungapped matches were recorded. To quantify the

likelihood of an exon boundary we use a similar equation as in sequence termina-

tion. Specifically, if an alignment position has n sequences, of which c coincide with

exon boundaries and the E-values of the corresponding alignments are e1, e2, . . . , ec

then the exon termination score is defined as

Eexon = log(e1 · e2 · · · · · ec)

115

5.2.3 Score refinement and normalization

Two additional steps are executed before the measures are fed into the neural net-

work. First, they are smoothed to eliminate random local fluctuations and improve

the discrimination power of the measure. The scores are smoothed by calculat-

ing the average over a window of size w (the smoothing factor). This parameter

is optimized to maximize the separation between the two types of positions, as

described in the next section.

Second they are normalized to a single scale. Since the different scores are

measured in different units, a straight forward combination of scores may intro-

duce a strong bias towards one or a few of them. Moreover, one would like to

have comparable values for different proteins. Therefore a proper normalization

is essential. To scale all measures to the same units we transformed every score

to a z-score based on the distribution of scores along all alignment positions. The

normalization is invoked separately for each alignment. The z-score does not only

serve as a universal scale but also provides a measure of statistical significance for

each position in the alignment, helping to locate a-typical positions.

In the case of sequence termination based scores, the intron score and the

consistency score we found that the distribution of scores is far from normal making

the use of z-score normalization inappropriate. In such cases we used a threshold

and linear scaling to map scores to the range [0,10].

5.2.4 Maximizing the information content of scores

To improve domain recognition, the distributions of domain positions and bound-

ary positions (according to each of the domain-information-content measures sug-

gested above) must be well separated. However, it is hardly ever the case that

116

Table 5.1: Jensen-Shannon (JS) divergence for top ten scores

λ = 0.5 λ = CB ratio

Score Smoothing JS Smoothing JS

window divergence window divergence

Combined Termination 7 0.073 10 0.018

Joint Termination 7 0.055 10 0.014

Symmetric Correlation 10 0.055 10 0.014

Proline 10 0.048 10 0.011

Mutation Profile 8 0.034 7 0.006

Class Entropy 10 0.024 10 0.004

Coil 10 0.024 10 0.005

Introns 10 0.020 8 0.005

Glycine 10 0.015 7 0.003

Small 8 0.010 8 0.002

Divergence values are computed using λ = 0.5 (equal prior) andλ = core/boundary (CB) ratio. The JS divergence for identical distributions is 0.

the two distributions are completely disjoint and the parameters introduced before

(the boundary window size x and the smoothing factor w) may greatly affect the

separation of these distributions.

To define the best set of parameters we measured the statistical similarity of the

two probability distributions for different sets of parameters, and selected the one

that maximized separation. To measure statistical similarity we used the Jensen-

Shannon (JS) divergence between probability distributions [Lin 1991]. This is a

variation over the KL divergence measure [Kullback 1959], that is both symmetric

117

and bounded (unlike the KL divergence). Formally, given two (empirical) proba-

bility distributions p and q, for every 0 ≤ λ ≤ 1, the λ-JS divergence is defined

as

DJSλ [p||q] = λDKL[p||r] + (1− λ)DKL[q||r]

where DKL[p||q] =∑

i pi log2pi

qiis KL divergence and r = λp + (1 − λ)q can be

considered as the most likely common source distribution of both distributions p

and q, with λ as a prior weight. The parameter λ reflects the a priori information.

In our case, the priors for in-domain positions p and boundary positions q differ

markedly and λ is set to the prior probability of in-domain positions. We call the

corresponding measure the divergence score and denote it by DJS. This mea-

sure is symmetric and ranges between 0 and 1, where the divergence for identical

distributions is 0.

Two examples of score distributions are given in Figure 5.6. Even measures

with near-identical distributions may be informative in a mutli-variate model where

higher level correlations can generate an effective boundary surface. Despite the

low information content of some of the constituent measures, the total information

content is more than the sum of the individual components due to sometimes weak

correlations between measures. The optimal complex decision boundary is learned

by training a neural network as described next. The top ten measures and their

Jensen-Shannon divergence are given in Table 5.1. Although better separation

was obtained with individual boundary windows, the final boundary window was

uniformly set to x = 10 (experiments with smaller window sizes decreased final

prediction accuracy) and the smoothing window w was set individually for each

score based on the optimization of the Jensen-Shannon divergence.

It should be noted that not all measures are independent of each other, and

118

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-4 -2 0 2 4

Prob

abili

ty

Z-score

Distribution of symmetric correlation scores

domain positionsboundary positions

0

0.05

0.1

0.15

0.2

0.25

-5 -4 -3 -2 -1 0 1 2 3 4 5

Prob

abili

ty

Z-score

Distribution of aliphatic residue scores

domain positionsboundary positions

Figure 5.6: Distributions of scores

119

Table 5.2: Most correlated score pairs.

Scores Correlation

Hydrophobicity and Buried 0.704

Small and Glycine 0.646

Aliphatic and Buried 0.619

Joint and Combined Termination 0.607

Hydrophobicity and Aliphatic 0.528

Coil and Proline 0.500

Aliphatic and Small 0.455

Molecular Weight and Positive 0.450

Aliphatic and Acylic 0.430

Aliphatic and Glycine 0.416

as expected some are highly correlated. It is interesting to analyze the correla-

tion between pairs of measures. The most correlated and anti-correlated pairs of

measures are listed in Tables 5.2 and 5.3.

Some of these correlations are in support of what is known about sequence-

structure relations in proteins. For example, Proline residues enable extended

chain conformations and are more likely to be seen in coiled regions. Similarly the

negative correlation between buried residues and those in coils is along expected

lines. In addition we also see reassuring examples like the correlation between

intron and joint termination scores and the negative correlation between alpha

helix regions and insertion entropy that provide support for the relevance and

correctness of our scores.

120

Table 5.3: Most anti-correlated score pairs.

Scores Correlation

Molecular Weight and Small -0.767

Beta and Alpha -0.747

Alpha and Coil -0.634

Molecular Weight and Aliphatic -0.628

Molecular Weight and Glycine -0.589

Acyclic and Proline -0.540

Buried and Coil -0.487

Hydrophobicity and Positive -0.469

Molecular Weight and Acyclic -0.392

Positive and Aliphatic -0.313

5.2.5 The learning model

Each one of the measures we described in Section 5.2.2 captures some aspects

or properties of domain transition signals. In many cases one or two measures

will be significant enough to indicate a domain boundary (see examples below).

However, usually none of them is significant enough and it is only their combination

that reveals the subtle signal. To find the optimal combination we trained a neural

network over the domain information content scores. A neural network is capable of

learning complex non-linear decision boundaries between categories and therefore

seems to be most suited for this task (an alternative model to try would be SVMs).

The inputs used were the individual scores in a position and the output learnt is

a number between 0 and 1, where 0 corresponds to a transition point and 1 to a

domain. We trained networks using the Matlab neural network toolbox, on a train

121

set of 484 proteins with a validation set of 237 proteins and a test set of 460 proteins.

We opted for a commonly used framework for neural network training: feed-forward

networks trained using the resilient back-propagation algorithm (trainrp under

Matlab) with a tangent sigmoid activation function.

There are various parameters that can influence the performance of the neural

networks. Firstly, since our training set is composed largely of core-domain posi-

tions the neural network is biased towards learning these positions well. In order to

circumvent this bias we used only a sampling of the core-domain positions. Various

choices of the ratio of core to boundary columns in the training set give various

tradeoffs in the predictive power of core and boundary positions in a test set and

so we experimented with this ratio as a parameter in our system. Secondly, since

a domain transition point is not singular we also tried to learn more complex net-

works that map multiple inputs (several positions along the sequence) to multiple

outputs. Our preliminary investigations showed that using multiple outputs always

decreased performance and so we restricted ourselves to varying the input window

size. Thirdly, while in theory using all the measures that we designed to train the

network should be optimal, in practice a smaller set of inputs can decrease the

search space for the neural network training system and thus improve performance

by decreasing the chances of being trapped in local minimas. The choice of the

number of features to use was therefore another parameter that we optimized for.

Finally, network architecture affects the expressive power of the network and can

play a crucial role in how well it learns a function. We restricted ourselves to net-

works with two hidden layers (as in theory this is enough to model any function)

and varied the sizes of the first and second hidden layers of the network.

We varied the above set of parameters in the ranges specified in Table 5.4.

122

Table 5.4: Ranges for parameters in network training

Parameter Values

Core-Boundary ratio 0.4, 0.8, 1.2, 1.6

Number of features 1, 2, 4, 7, 10, 15, 22

Input window size 1, 5, 9, 13, 17

Size of first layer 0, 5, 10, 15, 20, 25, 30

Size of second layer 0, 5, 10, 15, 20, 25, 30

In choosing the features for the network we tried two different strategies. In the

first case we sorted the set of 22 measures in the order of their Jensen-Shannon

divergence score (largest to smallest) and chose the various measures as features in

that order. This framework allows us to select the best individual features but is

not guaranteed to produce the set that would be optimal when combined together.

As an alternative we took the approach of selecting the principal components of

the vector space defined by the measures4, sorted in the order of their eigenvalues

(largest to smallest) as features in that order. This approach has the advantage

that addition of more components is expected to improve the performance of the

system in a predictable manner. However the drawback here is that since the

vector space that we are dealing with has high intrinsic dimensionality the first

few components do not describe the space adequately. As a result they are not as

informative as say the best measures used in the first approach.

Overall we trained more than 3000 networks for each of these approaches. As

can be seen from Figure 5.7 both these approaches lead to a similar set of results.

In general, our choice of values for the core-boundary ratio provides a reasonably

4Each alignment column is represented by a vector of measures.

123

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

% c

orre

ct p

redi

ctio

ns fo

r Bou

ndar

y Po

sitio

ns

% correct predictions for Core Positions

Network Performance with the scores as features

(a) Trained using the scores as fea-

tures

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

% c

orre

ct p

redi

ctio

ns fo

r Bou

ndar

y Po

sitio

ns


Network Performance with principal components as features

(b) Trained using principal compo-

nents as features

Figure 5.7: Performance of networks as a function of the features used

smooth tradeoff curve between prediction accuracy on core and boundary columns

and defines distinct regions of the curve as seen in Figure 5.8. Increasing the

number of features seems to improve the overall performance of the networks but

after the top 10 measures have been used the improvement is negligible (similar

behavior is seen when we use the principal components as features). Increasing

the input window size does not lead to an overall increase in performance. In fact

the performance seems to decrease slightly with larger window sizes (leading to

networks with higher accuracy on core positions but lower accuracy for boundary

positions). Finally the results seem to be remarkably independent, in an overall

sense, of the size of the network as can be seen in Figure 5.8.

The predictions of the neural network in our system are further post-processed

(see Section 5.2.6) to produce the final predictions. As a result the choice of the

network that will optimize the overall performance of the system is not obvious. In

addition there is a tradeoff between the accuracy and coverage of domain boundary

124

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

% c

orre

ct p

redi

ctio

ns fo

r Bou

ndar

y Po

sitio

ns


Effect of varying core to boundary ratio

0.40.81.21.6

(a) Ratio of core to boundary

columns

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

% c

orre

ct p

redi

ctio

ns fo

r Bou

ndar

y Po

sitio

ns% correct predictions for Core Positions

Effect of varying the number of measures

124

1022

(b) Number of features

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

% c

orre

ct p

redi

ctio

ns fo

r Bou

ndar

y Po

sitio

ns


Effect of varying the input window size

15

17

(c) Input window size

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

% c

orre

ct p

redi

ctio

ns fo

r Bou

ndar

y Po

sitio

ns


Effect of varying the network size

largemedium

small

(d) Network size

Figure 5.8: Performance of networks as a function of various parameters

125

predictions (see Section 5.3.) To resolve the question of which neural network

to use we start by pruning our set of networks to only those networks that are

not strictly dominated by any other network in terms of network performance

(this corresponds to the points on the outer boundary of the curve in Figure 5.7.)

Since the performance for the principle component based networks is similar to

the performance of the networks that use the scores as features, we retain only the

142 networks that are trained on the scores. Some of the representative points in

this set are presented in Table 5.5. We continue the discussion of the appropriate

network to choose in Section 5.3.

5.2.6 Hypothesis evaluation

The neural networks that we trained do not take into account the predictions for

neighboring positions (and for the protein as a whole) while making a prediction for

a position5. Thus, despite the high rate of accurate predictions for single positions,

the final predictions may overly fragment proteins into domains.

To refine the initial predictions of the neural-net the following three steps are

employed. First, to eliminate spurious transition points the curve is smoothed.

This way, a position is predicted as a candidate transition point only if a signifi-

cant fraction of the positions around it are predicted as transition points by the

neural network (this fraction can be altered as a threshold parameter to give differ-

ent levels of accuracy and sensitivity as is described in Section 5.3). Secondly, for

regions below the threshold all the minimas are predicted as candidate transition

points (see Figure 5.9). The third step is the most important one. Each possible

5attempts to learn the mapping from local neighborhoods in the input space tolocal neighborhoods in the output space failed to improve the performance

126

Table 5.5: A sample from the set of selected networks

Number of Input window Core-Boundary Size of Size of % correct % correct

features size ratio first layer second layer core predictions boundary positions

10 9 0.4 20 15 0.04 0.98

4 3 0.4 25 30 0.23 0.95

15 1 0.4 20 30 0.39 0.90

4 9 0.8 15 5 0.56 0.80

10 3 0.8 5 30 0.63 0.76

15 1 0.8 30 25 0.70 0.70

22 3 1.6 25 5 0.81 0.57

10 5 1.6 5 10 0.88 0.43

2 17 1.6 5 30 0.91 0.30

7 5 0.4 25 30 0.96 0.20

127

Net

wor

k O

utpu

t

Columns

Threshold

A

B CE

D

Candidate Set 1:

Candidate Set 2:

Candidate Set 3:

Candidate Set 4:

Candidate Set 32: {A,B,C,D,E}

{C,D}

{A,D}

{A}

{}

Log Posterior Probability: −51.2





111011001001110000010110000010110000110001100101111011111111111111011101011101110111001000100010000000101100010111111

Initial PredictionsCandidate Transition PointsFinal Predictions

Figure 5.9: Selecting candidate transition points

The initial predictions are smoothed and a set of candidate transition points isdefined. This set is processed and a final set of transition points is predicted.Note that the network output is shown to be 0-1 only for schematic purposes.

combination of candidate transition points is a possible partitioning of the protein

into domains (see Figure 5.9). Given multiple hypotheses, i.e. alternative parti-

tions of the query sequence into domains, we would like to find the most likely

one. We experiment with two post-processing setups: the simple model and the

domain-generator model. Both methods take the output of the neural network

and consider all minima of the smoothed curve as suspected domain boundaries,

in search for the best hypothesis (partition). We now turn to describe the two

models in detail.

128

5.2.6.1 The domain-generator model

The domain-generator model assumes a random generator that moves repeatedly

between a domain state and a linker state and emits one domain or transition at a

time according to different source probability distributions. Thus the probability

of a sequence of domains is given by the product of domain-emission probabilities

and the transition probabilities

Formally, we are given a protein sequence and a multiple alignment S of length

L and a possible partition D of S into n domains D = D1, D2, . . . , Dn of lengths

l1, l2, . . . , ln (as suggested by the output of the neural-net). Our goal is to find the

most likely model, i.e. the partition that maximizes the posterior probability of the

model given the data P (D|S). Our algorithm enumerates all possible combinations

of these positions and the one that maximizes the posterior probability is selected.

Note that while this could be computationally expensive, for most proteins the

number of candidate transition points is less than 15 (as is the case for all proteins

in our test set) thus making this process feasible.

To compute the posterior probability we first estimate the prior and the likeli-

hood of the data given the partition P (S|D), based on the precalculated measures

described in Section 5.2.2. By Bayes formula we can then estimate the posterior

probability

P (D|S) =P (S|D)P (D)

P (S)

The denominator is fixed for all hypotheses and so we are looking for the partition

that will maximize the product of the likelihood P (S|D) and the prior P (D)

Computing the prior: To calculate the prior P (D) we have to estimate the

probability that an arbitrary protein sequence of length L will consist of d domains

129

of the specific lengths l1, l2, . . . , ln. What we need to calculate then is

P (D) = P ((D1, l1)(D2, l2) . . . (Dn, ln) s.t. l1 + l2 + ..+ ln = L)

This can be estimated from the data by considering known domain partitions of

proteins of length L. However, the amount of data available is not enough to

accurately estimate these probabilities for all possible partitions. We approximate

this probability by using a simplified model; given the length of the protein, the

generator selects the number of domains first and then selects the length of one

domain at a time, considering the domains that were already generated. For

a partition into n domains there are n! possible orderings of the domains and

therefore the prior probability of the partition is approximated by

P (D) ' Prob(n|L) ·∑

π(l1,l2,...,ln)

P0(l1|L)P0(l2|L− l1)

. . . P0(ln−1|L−n−2∑

1

li)

where Prob(n|L) is the prior probability that a sequence of length L constitutes of

n domains and P0(li|L) is the prior probability to emit a domain of length li given

a sequence of length L. The term π(l1, l2, . . . , ln) denotes all possible permutations

of l1, l2, . . . , ln.

The prior probabilities P0(li|L) are approximated by P0(li), normalized to the

relevant range [0..L], and are estimated from the empirical distribution of domain

lengths in the SCOP database6. The empirical distribution is very noisy, sparse

for domains longer than 600 amino acids and biased due to uneven sampling of the

protein space, even after eliminating redundancy (see Figure 5.10a). To overcome

6Ideally, we would like to use P0(li|L). However, the SCOP data set is very noisyand the resulting distributions are heavily biased towards the domain definitionsin SCOP.

130

0

0.005

0.01

0.015

0.02

0 200 400 600 800 1000

Prob

abili

ty

Length

Distribution of domain lengths

originalunbiased

(a) Before and after eliminating bias.

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

0.005

0 200 400 600 800 1000

Prob

abili

ty

Length

Distribution of domain lengths

empiricalEVD

(b) After smoothing.

Figure 5.10: Distributions of domain lengths

131

the bias we retain only one entry of the same length from each protein family

(Figure 5.10a). Noise and sparse sampling for domains longer than 600 amino acids

are handled by running a few smoothing cycles that resulted in the distribution

plotted in Figure 5.10b. Interestingly, the obtained distribution follows closely the

extreme value distribution (see Section 5.3.6 for discussion).

The second term, Prob(n|L) is given by Prob(n|L) = Prob(n, L)|P (L) where

Prob(n, L) is estimated by the (n− 1)th order sum

Prob(n, L) =

L∑

1

P0(x1)

L∑

1

P0(x2) . . .

L∑

1

P0(xn−1) · P0(L− x1 − x2 − · · · − xn−1)

and P (L) is simply given by the complete probability formula

P (L) =L∑

i=i

Prob(i, L)

The extrapolated distributions for n = 1..7 are plotted in Figure 5.11a. It should be

noted that the empirical distributions differ quite markedly from these extrapolated

distributions (Figure 5.11b). However, since the data is noisy, sparse and possibly

biased, we consider the extrapolated distributions to be more reliable than the

empirical ones. For one, note that the empirical probability for a protein to be a

single domain dominates all other scenarios up to proteins of length 400(!), while

the curves meet much earlier (around 200) in the extrapolated distributions. Our

observation is also supported by the quite different distributions observed in the

CATH database, further deprecating the reliability of the empirical distributions.

The impact of the extrapolated distributions is indeed evident in our results (see

Section 5.3). Our model tends to predict more domains than SCOP, and in many

cases refines SCOP partitions into more compact substructures.

132

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000 1200 1400

Prob

abili

ty

Length

Distribution of domain complexity (number of domains)

1 domain2 domains3 domains4 domains5 domains6 domains7 domains

(a) Extrapolated

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700

Prob

abili

ty

Length

Distribution of domain complexity (number of domains)

1 domain2 domains3 domains4 domains

(b) Empirical

Figure 5.11: Distributions of number of domains

The extrapolated distributions are normalized assuming that the maximalnumber of domains is 7 (the maximal number of domains observed in SCOP). Inour calculations we considered up to 20 domains. These probabilities can beprecalculated using a dynamic programming algorithm.

133

Computing the likelihood: To calculate the likelihood of the data given the

model P (S|D) we use the probabilities of the observed scores given the domain

structure as predicted by the neural-net. We consider the individual domains

and the transitions between domains (the linkers) as two different sources. Each

source induces a unique probability distribution over the domain-information con-

tent scores (see Section 5.2.2). Specifically, given the model D that partitions the

sequence S into n domains and n− 1 transitions D1, T1, D2, T2, . . . , Tn−1, Dn that

correspond to the subsequences s1, t1, s2, t2, . . . , tn−1, sn we estimate the likelihood

by

P (S|D) = P (S|D1, T1, D2, Tn−1, Dn)

= P (s1|D1)P (t1|T1)P (s2|D2) ·

P (t2|T2) . . . P (tn−1|Tn−1)P (sn|Dn)

where we already employed the assumption that the domains are independent of

each other (see Section 5.2.6.3 for discussion). Each one of the terms P (si|Di) and

P (tj|Tj) is a product over the probabilities of the individual positions. The proba-

bility on an individual position j in domain i is estimated by the joint probability

distribution of the k features that are used in our system

P (sij|Di) = P (f1, f2, . . . , fk|Di)

However, estimating this probability is impractical given the amount of data we

have. On the other hand, given the correlation between scores (see Section 5.3.2)

the independence assumption for the individual scores does not hold. Therefore

we adopt an intermediate approach. We start by writing the exact formulation of

the joint probability distribution of k random variables X1, X2, . . . , Xk using the

134

expansion

P (X1, X2, .., Xk) = P (X1)P (X2|X1)P (X3/X1, X2)

. . . P (Xk|X1, X2, .., Xk−1)

where the random variables can be ordered in an arbitrary order. We then derive an

approximation to these probabilities using first-order dependencies7 and a heuristic

expansion. The methodology is as follows: for each pair of random variables X, Y

we calculate the distance between the joint probability distribution and the product

of the marginal probability distributions

DEPEN(X, Y ) ≡ Dist(PXY , PXPY )

This distance (measured either using the l1 norm or the JS divergence measure) is

a measure of the dependency between the two variables. The larger it is, the more

dependent are the variables (one might also consider using the mutual information

measure instead).

We sort all pairs based on their distance and pick the most dependent one first

(denoted by Y1 and Y2) to start the expansion

P (X1, X2, .., Xk) = P (Y1)P (Y2|Y1) . . . . . .

The next terms are selected based on their strongest dependency with variables

that are already used in the expansion. Thus

Y3 = arg maxY{max{DEPEN(Y, Y1), DEPEN(Y, Y2)}}

Denote by Z = PILLAR(Y ) the random variable that Y is most dependent on

(of the random variables that are already in the expansion), then of all possible

7Pair statistics can be calculated quite reliably from our data set, but the datais too sparse to derive reliable estimates of higher order statistics

135

dependencies involving Y3 we pick P (Y3|PILLAR(Y3)) and add it to the expansion

P (X1, X2, .., Xk) = P (Y1)P (Y2|Y1) · P (Y3|PILLAR(Y3)) . . . . . .

The procedure continues until all variables are accounted for. This heuristic at-

tempts to minimize the errors that are introduced by relaxing the dependency as-

sumption to a first order dependency by maximizing the support for each random

variable we introduce in the expansion. Thus, highly correlated variables affect the

total probability only marginally, while under the independence assumption they

might introduce a substantial error (other, alternative methods for approximat-

ing the joint probability distribution from the marginal distributions are described

in [Ireland & Kullback 1968] and [Pearl 1997]). Note that the expansion for do-

main regions can be different from the expansion for linker regions, as the source

distributions differ.

However, once the two expansions (for domains and linkers) are defined based

on the pair statistics, the same two expansions are used for all domains and all

linkers.

Hypothesis selection: Given a set of N candidate transition points (the minimas

of the neural network output), our algorithm enumerates all possible combinations

of transition points to form 2N possible partitions (hypotheses). For each partition

we calculate the posterior probability (using our domain-generator model) and

eventually output the most likely one. The whole calculation is very fast. For

example, for a protein of length L = 300 and a set of N = 10 possible transition

points, the algorithm will output the most probable hypothesis in a matter of

minutes.

136

5.2.6.2 The simple model

In the simple model, the candidate transition points are listed in decreasing order of

reliability (as measured by the depth of the corresponding minima in the smoothed

curve) and considered in this order. Once a minima is selected all minima that are

within a window of k amino acids around it are rejected (where k is a function of

the protein length). This is a greedy approach that seems to work pretty well for

many proteins. The depth of the minima is a good approximation of the overall

posterior probability of the transition points P (Ti|ti), as the network essentially

assigns a value O(i) that indicates the network’s confidence in this position as being

an in-domain position. Thus 1−O(i) (the depth of the minima) is the probability

that this position is a boundary position.

5.2.6.3 The independence index

Both our models explicitly or implicitly assume that the domains across transition

points are independent. However, when searching for the best model one should

also consider the validity of this assumption and the “quality” of the predicted

transition points. Not only should they indicate domain boundaries, but they

should also justify the independence assumption over neighboring domains that

we employed above.

We define the following confidence or independence index for each transition

point. This index estimates the likelihood that the domains on both sides of the

transition point are independent of each other. This likelihood is estimated as

follows: if indeed the two domains were formed independently then the patterns of

sequence divergence should be different. By comparing the divergence patterns one

can indirectly measure the statistical similarity of the sources that generated the

137

two domains. The divergence pattern is given by the distribution of evolutionary

distances of sequences in the alignment of each domain (using the subset of n

common sequences). For each sequence we approximate its evolutionary distance

from the query seed sequence by counting the number of point mutations per

100 amino acids. The specific divergence pattern (the vector of n − 1 distances)

is a reflection of the statistical source that generated the domain. To quantify

the likelihood that the source distributions are unique we compute the pearson

correlation between the two divergence patterns and this gives us our independence

index. Zero correlation indicates two unique sources (independent domains).

To assess the quality of each individual transition point we compute the in-

dependence index, and report its statistical significance in terms of its z-score

(computed based on the background distribution of independence indices over a

large set of randomly selected positions). These numbers are reported for each

transition point in the final prediction. Thus, the user can evaluate not only the

plausibility of the overall partition but also of each individual transition. For com-

parison, the average independence index for random positions is 0.79 (standard

deviation of 0.26), while for true transition points the average is 0.68 (standard

deviation of 0.34). In other words, true transition points partition proteins into

less correlated domains, as desired.

5.3 Results

To test our approach we run our system on a subset of 460 proteins that were

excluded from the training set. The test set was well balanced in terms of the

number of multi-domain proteins with 222 single domain and 238 multi-domain

proteins (of which 179 are two-domain, 43 are three-domain, 13 are four-domain

138

and 3 are five-domain proteins). For each of these proteins the prediction was

compared to that of SMART [Ponting et al. 1999], Tigr [Haft et al. 2001], Pfam

[Bateman et al. 1999] and ProDom [Sonnhammer & Kahn 1994], based on the in-

formation provided by InterPro [Apweiler et al. 2001] as well as predictions from

DOMO [Gracy & Argos 1998] obtained by running BLAST searches against the

DOMO database. Interpro predictions for ProDom are limited to a curated subset

of ProDom and so we also present results predicted directly by ProDom for pro-

teins in the test set that can be matched (based on their accession numbers) to

the complete ProDom database.

Since the predictions obtained from other systems are often incomplete for the

seed proteins in our test set, we needed to design an evaluation procedure that

would have different scores for accuracy and coverage. In addition, the predictions

may disagree with SCOP on the number of domains in the seed protein. Therefore

one needs to define a procedure for associating predicted transition points with

their most probable SCOP counterparts and vice versa. The simplest choice is

to assign every transition point that is being considered to the closest reference

transition point. Here we adopt this model and define the following four measures:

Distance accuracy. This measure evaluates predictions by using SCOP transi-

tion points as reference. For each seed protein we calculate the average distance of

the predicted transitions from their associated SCOP transition points. The final

value that is reported is the average distance over all proteins in the test set.

Distance sensitivity. This measure assesses the sensitivity in detecting true

domain boundaries by using the predicted transitions as reference. The average

distance of SCOP transitions from the associated predicted transitions is calculated

for each protein, with the value reported being the average of this distance over

139

all proteins in the test set.

Selectivity. For this measure we consider predictions that are within x = 10

residues of a SCOP transition as being correct with the final value reported being

the percentage of predictions that are considered correct for the entire set.

Coverage. Analogous to accuracy, SCOP transitions that are associated with

a predicted transition point within x = 10 residues are considered successfully

predicted. The percentage of correctly predicted SCOP transitions for the entire

set is reported.

Using these measures we evaluated the results of post-processing the network

output for the final set of 142 optimal networks (see Section 5.2.5) using both the

simple model and the domain-generator model. As can be seen in Figure 5.12,

even though the performance of none of these networks dominates that of the

other, the performance after post-processing may do so. We also observe various

tradeoffs for selectivity vs. coverage based on which network we use. The choice of

which network to use should depend on the application that we have in mind (and

therefore the tradeoff that we would like to work with). For example, application

of this method for structural genomics purposes might require high selectivity to

avoid fragments that cannot fold independently. On the other hand domain family

classification programs may prefer high coverage to generate accurate sub-domain

families that can then be merged to get the final domain families. For the purpose

of evaluation we chose a single network for each model, as described in Figure 5.12.

The tradeoff curves in Figure 5.12 are not very smooth and changing the trade-

off requires us to change the network and the inputs used. This setup is therefore

not amenable for the construction of a flexible system where we can easily move

on the tradeoff curve. We can however get a smooth tradeoff curve similar to that

140

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6

Cove

rage

Selectivity

Results for the final set of Networks

simple model

(a) Post-processing with simple model

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6

Cove

rage

Selectivity

Results for the final set of Networks

domain-generator model

(b) Post-processing with domain-generator

Figure 5.12: Coverage vs. Selectivity for final set of networks

For each model we select a single network to work with, marked with a box.These networks are selected such that no other network dominates them. Theyare located at the cusps of a sudden fall in performance. Interestingly, both thesepoints correspond to the same network that uses all the 22 features, an inputwindow size of 1, core-boundary ratio of 1.6 and with hidden layers of size 25 and5 respectively.

141

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Cove

rage

Selectivity

Tradeoff curve for a network

HMMPfamHMMSmart

HMMTigrBlastProDom

BlastDomodomain-generator model

simple model

Figure 5.13: Coverage vs. Selectivity tradeoff while varying the threshold

seen in Figure 5.13, for any fixed network that we choose, by varying the threshold

parameter (see Section 5.2.6) for the network ouput. This gives us the flexibility

of changing the performance of the system by altering a single parameter. The

curves seen in Figure 5.13 both have gentle cusps towards the top of the curves.

Both these points correspond to a threshold of 0.5. The results reported next are

obtained when setting the threshold parameter to that value.

First we evaluated our two post-processing methods. The results are sum-

marized in Table 5.6. Both methods perform almost the same, as measured by

the four performance indices described above. Nevertheless, the domain-generator

model has some advantages over the simple model. First, as opposed to the greedy

approach of the simple model, the domain-generator model considers all possi-

ble hypotheses. Moreover, it provides us with a critical statistical framework for

assessing alternative, competing hypotheses. The model can be used to assign

a confidence value to each hypothesis and by comparing these confidence values

(between the best hypothesis and the next best hypothesis or the set of all other

142

Table 5.6: Performance evaluation results for the two post-processing methods

Number of Accuracy/ Selectivity/

Predictions Sensitivity Coverage

(in residues) (percentages)

simple model 460 40/24 35/45

domain-generator 460 48/19 27/51

The number of predictions is the total number of proteins in the test set forwhich predictions were made. For each protein several transition points may bepredicted. Performance measures (accuracy, sensitivity, selectivity and coverage)are based on the complete set of predicted transition points.

hypotheses) one can define a significance measure and associate it with the out-

put hypothesis. In cases where the differences between competing hypotheses are

insignificant, one might also want to consider the alternative domain partitions.

A summary of the evaluation results for our method and other sequence based

methods is presented in Table 5.7. Our method significantly improved over all

other automatic methods, outperformed only by the manually calibrated Pfam

(see next section for discussion). Note that the criterion used to compute the

coverage and selectivity is very strict (the agreement must be within 10 residues).

One can relax this criterion by increasing the window size. This would result in

a 5-10% increase in performance for both measures when using a window of 15

residues, for example.

We also evaluated the overall consistency of the different methods. Specifically,

we ask how many proteins are predicted correctly completely, both in terms of the

total number of domains, and their exact locations. The results are summarized

in Table 5.8. Again, our method performed well compared to all other automatic

143

Table 5.7: Performance evaluation results for sequence based methods




Our method 460 40/24 35/45

HMMPfam 441 29/14 43/65

BlastDomo 252 17/70 22/12

BlastProDom (Complete) 218 29/45 19/27

HMMSmart 172 12/73 27/17

BlastProDom (Interpro) 123 8/90 30/6

HMMTigr 51 2/96 33/1

The relatively good accuracy values for HMMTigr, ProDom, HMMSmart andDomo are the result of the small number of predictions these methods make. Theselectivity and coverage values are more indicative of the overall performance ofeach method.

144

methods. Moreover, while other methods performed well mostly over single domain

proteins, our method performs well on many multiple domain proteins as well.

5.3.1 Inclusion of structural information in prediction

When evaluating the results one has to keep in mind that incorporation of struc-

tural information, when available, can improve the quality of predictions. Indeed,

the PFam database uses this information explicitly by defining domains using the

SCOP database. It is not surprising therefore, that the manually calibrated PFam

performed better on the test set. Their performance, however, may not be as good

over an independent data set. In order to correct this bias, one would ideally like

to generate a totally independent test set. However, since Pfam is in the process

of integrating all of SCOP definitions to determine their domain definitions it is

hard or almost impossible to generate such a set.

Instead, we tested the effect of incorporation of structural information on our

predictions. We repeated the process, this time including the SCOP sequences

in the database. Thus the alignments that we generate might contain SCOP

amino acid sequences of structural domains. However, these sequences are not

used arbitrarily in our system to chop the proteins into domains. Rather, they

add to the overall signal in each one of the constituent measures and it is the

cumulative contribution that is detected by our learning system. As a result, both

sequences of unknown structures and sequences of known structures can affect

the predictions. In other words, our learning system does not explicitly use the

structural information and it processes alignments that contain SCOP sequences

exactly the same way it processes alignments which are based purely on sequences

of unknown structures. The results of this procedure are summarized in Tables

145

Table 5.8: Global consistency results

Number Correct number Number of completely Correct predictions Correct predictions

predictions of domains correct predictions (single domain) (multi-domain)

Our method 460 267 205 134 71 (35%)

HMMPfam 441 309 276 178 98 (36%)

BlastDomo 252 148 118 98 20 (17%)

BlastProDom (C) 218 94 83 51 32 (39%)

HMMSmart 172 112 91 70 21 (23%)

BlastProDom (I) 123 83 75 63 12 (16%)

HMMTigr 51 23 21 20 1 (5%)

The two results for ProDom are those obtained using the complete definitions (C) and the interpro subset (I). Thepercentages in the last column are the percentages of correct predictions of multi-domain proteins out of all correctpredictions. Among the multi-domain proteins the percentages of correctly predicted two-domain proteins, three-domainproteins etc. remain roughly the same as their proportions in the test set.

146

Table 5.9: Performance evaluation results when structural information is used




Our method 460 27/6 63/83

HMMPfam 441 29/14 43/65

PFam explicitly uses the structural information available from SCOP domains.To test the effect of these sequences on the predictions we included them in thealignments, and used those alignments as input for our system. Thus, under thissetup, our system uses the structural information implicitly.

Table 5.10: Global consistency results when structural information is used

Number Correct number Number of completely

predictions of domains correct predictions

Our method 460 318 308

HMMPfam 441 309 276

5.9 and 5.10. Note the significant improvement in performance for our method.

Especially notable are the significant coverage and selectivity.

5.3.2 Examples

The overall performance of our method shows that the model is capable of learning

even subtle signals that indicate domain boundaries. Our first example is a three

domain protein that was predicted accurately for all its domains. This is the PDB

protein 1qpb (chain B), 563 residues long. The protein is partitioned by SCOP

into three domains that correspond to positions 2-181, 182-360 and 361-556. Our

147

prediction suggests transition points at positions 181 and 354 (see Figure 5.14)

within 6 residues from SCOP definitions. These positions are correlated with

strong combined termination and insertion entropy signals. In addition there is

an abundance of proline residues around positions 180 and 360 and there are class

entropy spikes around positions 110, 180, 360 and 500. For comparison, PFam

predicts three thiamine pyrophosphate enzyme domains at positions 2-180, 197-

348 and 361-538. No predictions were available from ProDom, DOMO, SMART

or Tigr.

Another example where our method correctly predicted all the domain transi-

tion points is for the protein 1g8h (chain B). However, in this case none of the other

sequence-based predictions (including PFam) were able to partition the protein cor-

rectly. This protein is 511 amino acids long and according to SCOP it consists

of three independent domains, between positions 2-168, 169-389 and 390-511 (see

Figure 5.15). Our prediction locates domain boundaries at positions 165 and 392,

within three residues from the SCOP definition. In PDB, 1gh8 is annotated as

an archaeal translation elongation factor. However a HMM search using PFam

reports the main domain being an ATP-sulfurylase between positions 72-392. A

look at the structure of the protein clearly shows that this is an unsatisfactory

domain definition. Similarly Prodom (Interpro) predicts a domain between posi-

tions 37 and 393. Both Domo and Tigr make similar predictions (1-396 and 4-386)

that merge the first and second domains into one large domain. No predictions are

available from SMART. Detailed analysis of our system in this case reveals com-

bined termination signals at positions 80, 180, 290 and 390 and weighted mutation

profile troughs at positions 120 and 390. Peaks in insertion entropy are also seen

at positions 140, 160 and 250 and an abundance of proline residues is seen around

148

Figure 5.14: Domain definitions for 1qpb

Our method predicts three domains. The transition points are marked by theirresidue numbers.

149

positions 260 and 390.

5.3.3 Suggested novel partitions

The list of proteins on which our method failed to correctly predict domain bound-

aries as defined by SCOP revealed interesting cases. Many of them raise serious

questions about the validity of SCOP definitions. For example, PDB protein 1acc

(735 amino acids long) is defined as a single domain in SCOP. Our analysis suggests

three domains at positions 1-160, 161-586 and 587-735 (see Figure 5.16). As the

figure illustrates, this partition seems to better satisfy the definition of a domain as

a compact, independent foldable unit. Moreover, given the distribution of domain

sizes in proteins (see Section 5.2.6.1), it is not very likely to have protein domains

that are longer than 700 amino acids, thus further supporting our hypothesis. For

comparison, Pfam detects one domain at positions 103-544 (PF03495 Clostridial

Binary exotoxin B) and Domo predicts two domains at positions 1-647 and 648-735.

No predictions are available from ProDom (Interpro), SMART or Tigr.

In this case we get a clean and strong joint termination signal at positions 160

and 590, and a remarkably consistent alignment between positions 170 and 580.

This signal is reinforced by other measures: the hydrophobic curve has three major

troughs at 170, 290 and 570, insertion entropy has major peaks at 180, 310 and

560 and correlation is pretty low around 200, 280 and 590.

Another interesting example is the PDB protein 1ffv (chain E) that is 803

residues long and partitioned by SCOP into two domains defined by the positions

7-146 and 147-803. Our method predicts four domains at positions 1-141, 142-426,

427-591 and 592-803 (see Figure 5.17). While our prediction agrees with SCOP in

defining the first domain, it further partitions the second domain into three sub-

150

Figure 5.15: Domain definitions for 1gh8

In this case a mosaic of signals (combined termination, weighted mutation profile,insertion entropy and proline) is integrated by our system into two predictions(three domains) that are in good agreement with SCOP’s structural definition.

151

Figure 5.16: Domain definitions for 1acc

SCOP define this protein as a single domain. Our analysis suggests threecompact units.

152

units. Analysis of the protein structure indicates that the second domain predicted

by our method does defines a distinct, reasonably compact structural domain. In

addition while the third and fourth domains are intertwined in space, there seems

to be a clear symmetry in their construction suggesting the possibility that they

arose as a result of duplication. Interestingly, CATH also partitions the protein

into four domains though the definitions are much more complicated (domain1:

7-141 and 210-306, domain2: 142-209 and 307-383, domain3: 484-649, domain4:

440-483 and 650-803). The signals that our method gets for predicting the addi-

tional domain boundaries at positions 426 and 591 are quite strong. In addition to

a strong neural-network output we also observe strong sequence termination and

class entropy signals around all three positions.

In both cases, SCOP definitions might be inaccurate because of the lack of

structural information to support the existence of these domains. SCOP domains

are defined as recurrent structural subunits and in the absence of other copies of

these domains the proteins are left untouched. Our analysis indicates that had

the structures of related proteins been resolved such evidence would have become

available. In the presence of such strong signals based on sequence information

it is clear that the domain structure of proteins cannot be determined based on

structural information alone.

5.3.4 Analysis of errors

Our method does fail in cases where signals are misleading. This usually seems to

happen when the domain definition for the protein is complicated by the unusual

structure and topology of the protein. One such case is for the beta-barrel protein

1qkc. It is classified by SCOP as a single domain protein while our method predicts

153

Figure 5.17: Domain definitions for 1ffv

Our analysis of this protein helps us to identify two likely domain boundariesthat are missed by SCOP and that help partition the protein into more compactdomains (domains are rotated for visual clarity).

154

three domains defined by the positions 1-256, 257-394, 395-725 (see Figure 5.18).

In comparison PFam predicts a domain between positions 615 and 725 and Domo

predicts two domains at positions 21-337 and 338-725. In general, beta-barrel pro-

teins are considered hard test cases, even for structural domain classifiers. While

our predictions clash with the standard definition of classifying the entire barrel

structure as one domain it is interesting to note that both boundary predictions

made by our method are in looped regions, even though it is much more likely

that a prediction lies in a beta strand region (based on the beta to loop ratio). In

addition, while it is not clear if the domains predicted by our method are the cor-

rect pieces, it seems quite plausible that the beta-barrel structure evolved by the

fusion of two or more barrel pieces. The domain boundary predicted by DOMO

also lends some support to our prediction. Further investigation from a biological

perspective of the pieces that we identify as domains may help prove or disprove

this hypothesis.

Another unusual case is the PDB protein 1i6v that is 1118 residues long. SCOP

classifies this protein as a single domain protein. Our method partitions the protein

into four domains defined by the positions 1-220, 221-513, 514-830 and 831-1118

(see Figure 5.19). As can be seen from the rasmol ribbons image, this protein is

highly unstructured and has a complicated topology. The domains defined by our

method do not partition the protein into clean, structurally distinct units. However

they do indicate that 1i6v is probably not a single domain protein. Our predictions

are supported by significant confidence index values (see Section 5.2.6.3) as well.

The length of the protein is another factor that suggests that this protein is multi-

domain. It is possible that some of the domains in 1i6v are non-continuous, further

complicating domain prediction.

155

Figure 5.18: Domain definitions for 1qkc

Example of a beta-barrel protein where our method predicts component domainsthat need further investigation in order to be validated.

156

Figure 5.19: Domain definitions for 1i6v

We believe that many of the “errors” will be resolved as more structures are

solved and SCOP definitions are refined. In some cases, the situation wil require a

more precise definition of what a domain is. Finally, increase in sequence data and

design of more sophisticated measures employing additional sources of information

will help to improve predictions.

5.3.5 Consistency of domain predictions

Our gold standard so far was the SCOP database of protein domains. The domains

in this database are defined manually based on visual inspection of protein struc-

tures, however, there is no assurance that the definitions are indeed accurate and

correspond to the “true” definitions. Since no quantitative rules or principles are

used, different points of view might lead to somewhat different domain definitions.

To assess the stability and accuracy of our domain prediction algorithm we

tested it on CATH [Orengo et al. 1997] which is another structure-based domain

classification system. CATH combines sequence analysis with structure comparison

157

Table 5.11: Performance evaluation results using domain definitions in CATH




SCOP 220 14/13 74/76

simple model 220 37/27 34/42

domain-generator model 220 46/23 27/47

HMMPfam 209 32/24 36/52

BlastDomo 125 17/65 24/14

BlastProDom (Complete) 104 32/48 15/23

HMMSmart 80 11/75 22/12

BlastProDom (Interpro) 62 8/86 31/7

HMMTigr 22 3/92 20/1

algorithms to determine structural domains. Of the 238 multi-domain proteins in

our test set we were able to map 158 proteins to release 2.4 of CATH8. Of the

222 single domain proteins in the test set we were able to map almost all (217)

to CATH. Of the 158 multi-domain proteins, 48 contained discontinuous domains

(according to CATH) that cannot be predicted with our method (see discussion

below) and therefore were eliminated. To keep the numbers of single and multi-

domain proteins balanced we sampled 110 proteins from the list of single domain

proteins to get a new test set of 220 proteins.

8Based on the PDB identifiers we were able to map most of the proteins (197 outof 238), but since CATH uses the ATOMRES records while we use the SEQRESrecords of the PDB files, there were some discrepancies (gaps, and length mismatchbetween ATOMRES and SEQRES records) that deemed some files unusable fortesting.

158

We repeated our performance evaluations over this set of 220 proteins using

the CATH definitions as the standard of truth. The results are given in Table

5.11. As can be seen from the first line of the table, while CATH and SCOP are

in pretty good agreement they do differ in some cases. Based on comparison with

the results in Table 5.7 we can see that the performance of our method is stable

across CATH and SCOP. The stability of our results therefore indicates that our

methodology learns a more general concept of domains. In contrast, we see that

the performance of PFam on CATH is not as good as on SCOP. This could be

explained by the fact that PFam definitions are often guided by SCOP definitions.

We studied example cases where our predictions were different from those of

CATH. We found that in general in such cases CATH differs from our method (as

well from SCOP) because of its tendency to assign small structural fragments from

one sequence domain to another based on structural compactness considerations.

An example of such a situation is the protein 1ekx (chain A) that is 311 residues

long. SCOP defines two domains, the first one between positions 2-151 and the

second between positions 152-311 (see Figure 5.20). Our method predicts one

transition point at position 151, in excellent agreement with the SCOP definition.

The predictions from PFam (8-150, 153-305) and Prodom (7-150, 157-306) also

agree with this definition. CATH defines the first domain as a combination of

two fragments 1-133 and 292-310 and a second domain at positions 134-291. This

results in a fragment of an alpha helix being assigned to the first domain based on

compactness considerations alone.

The inconsistency with our method is not surprising, as our definition of a

domain is evolutionary motivated. Our model assumes that protein domains are

ancient and evolutionary conserved sequence fragments that have emerged as pro-

159

Figure 5.20: Domain definitions for 1ekx

The differently colored segments on the top left and bottom right define the twodomains of the protein. CATH assigns the fragment between positions 292-310(in blue) to the domain on the top while our method and SCOP assign it to thedomain on the bottom.

160

tein building blocks during evolution. This does not cover all possible domain

definitions. Multiple studies showed that in practice the structural arrangement of

proteins can form compact substructures that are sequence discontinuous. How-

ever, such sequence discontinuous domains need accurate structural information to

delineate them correctly, and it is not clear if it is possible to detect these domains

based on sequence information alone. In the absence of clear evolutionary evidence

supporting this assignment, it is also not clear how to translate such definitions to

our domain definitions. Moreover, the signals, if they exist, might be different from

those for continuous domains, and to learn these signals would require designing

a different learning system. These issues make the identification of discontinuous

domains a harder and possibly orthogonal problem to the one that we tried to

solve in this study.

5.3.6 The distribution of domain lengths

We were intrigued by the fact that the distribution of domain lengths follows

closely the extreme value distribution (EVD), as in Figure 5.10b. This distribu-

tion has been studied extensively, in particular in the context of sequence similarity

[Karlin & Altschul 1990, Dembo & Karlin 1991] and has been used by packages

such as BLAST [Altschul et al. 1997] and FASTA [Pearson & Lipman 1988] to as-

sociate statistical significance measures (E-values) with similarity scores. However,

its appearance in the context of domain lengths is surprising and deserves further

study.

161

5.4 Discussion

In this chapter we presented a novel method for detecting the domain structure of

a protein from sequence information alone. Our method utilizes the information

in sequence databases and starts by comparing the query sequence with all the

sequences in the database. The search generates a multiple alignment and the

alignment is processed fully automatically in search for domain transition signals.

There are several novel elements in our method. First, our method uses multi-

ple scores. Some of the scores we designed are variations on measures that were

suggested in earlier studies (e.g sequence participation and correlation scores were

used in DOMO, ProDom and PASS and correlated mutations were used in Rig-

den’s work). However, we introduce many novel scores based on the analysis of

basic sequence properties or predicted properties, scores that are calculated from

multiple alignments and scores that are extracted from external resources such

as intron-exon data. Secondly, we use information theory principles to optimize

the scores and select the subset that maximizes the domain information content.

Thirdly, a neural network is trained to learn a non-linear mapping from the original

scores to a single output. Finally, a probabilistic domain-generator model is devel-

oped to assess multiple hypotheses and predict the most likely one. Unlike local or

heuristic methods that employ a greedy search through the hypothesis space, our

model exhaustively enumerates all possible partitions of the protein into domains,

until it finds the optimal one. This multi-stage system is not only robust to align-

ment inaccuracies, but it can also tolerate partial information. It can be extended

and generalized to include other types of scores. Most importantly, our method

suggests for the first time a rigorous model that can test all possible hypotheses

and output the one that is most consistent with the data. We also developed an

162

evaluation framework that hopefully will provide a clearer understanding of the

strengths and weaknesses of the algorithms that have been designed so far and thus

aid in the design of better algorithms. Moreover, our domain-generator model can

associate a statistical significance score for every hypothesis, thus enabling us to

compare different hypotheses by the same method or even different hypotheses by

several different methods.

We trained and tested our method on what is considered to be the gold stan-

dard in protein structure classification, the SCOP database of protein domains.

Our method performed very well compared to all other methods currently available

while being fully automatic. One should keep in mind that SCOP is a man-made

classification and the definitions of domains do not necessarily conform with “na-

ture’s definitions”. Indeed many of our supposedly errors seem to make sense when

inspected visually. Moreover, SCOP might be inaccurate near domain boundaries,

as the selection of the actual transition point is quite arbitrary. Our method pro-

vides a rigorous and accurate way to predict not only the domain structure but also

the most likely transition points and can be used to augment or guide predictions

based on structural data.

The utility of our tool goes beyond simple structural analysis of proteins. It

can help in predicting the complete 3D structure of a protein, as the task can be

divided into smaller tasks, given the predicted domain structure of the protein. It

can have significant impact on structural genomics efforts. The high throughput

structural determination of proteins is more likely to succeed when the proteins are

broken into smaller, structurally stable units. Using our model to predict domain

boundaries can help in that aspect too. Finally, it is essential for the study of

proteins’ building blocks and for functional analysis.

163

There are several variations to the model described here that we consider intro-

ducing in the future. Although our algorithm is not overly sensitive to alignment

accuracy, obviously better multiple alignment algorithms are expected to improve

the performance. Since the system uses the domain-generator model to process

hypotheses, it is less sensitive to the exact details of the learning system, however,

replacing the neural network with another learning system (such as SVMs) might

also improve performance slightly. Another possible improvement is the integra-

tion of a weighting scheme into the multiple alignment. Currently all sequences

are weighted equally. However, due to the biased representation of protein families

in sequence databases and the nature of sequence comparison algorithms, diverged

sequences that might provide us with crucial information about domain bound-

aries are usually underrepresented in these alignments. To eliminate this bias one

should decrease the weight of highly similar sequences and increase the weight of

highly diverged sequences. Preliminary attempts in that direction (implementing

the schema described in [Henikoff & Henikoff 1994]) did not show a significant im-

provement, however the results are not conclusive. Hopefully these variations will

further fine-tune the performance of our system.

Finally, our method can be easily extended to include structural information to

aid in the process of domain prediction. All it takes is to include these sequences

in the alignment. If the learning system recognizes a strong signal (e.g. sequence

termination) that is consistent with other sequences of unknown structure, a pre-

diction will be made that is in agreement with the structural information. This

approach can help in unifying manual expert-based approaches with more rigor-

ous information-content based methods, to produce more reliable predictions of

domains.

164

5.5 Acknowledgements

This work was done in collaboration with and under the guidance of Dr. Golan

Yona.

165

BIBLIOGRAPHY

[Murvai et al. 2001] Murvai, J., Vlahovicek, K., Szepesvari, C. & Pongor, S.(2001). Prediction of Protein Functional Domains from Sequences Using Ar-tificial Neural Networks. Genome Res. 11, 1410-1417.

[Miyazaki et al. 2002] Miyazaki, S., Kuroda, Y. & Yokoyama, S. (2002). Char-acterization and prediction of linker sequences of multi-domain proteins by aneural network J. Structural and Functional Genomics 15, 37-51.

[Altschul et al. 1997] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J.,Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST:a new generation of protein database search programs. Nucl. Acids Res. 25,3389-3402.

[Apweiler et al. 2001] Apweiler, R. et al. (2001). The InterPro database, an in-tegrated documentation resource for protein families, domains and functionalsites. Nucl. Acids Res. 29, 37-40.

[Bairoch & Apweiler 1999] Bairoch, A. & Apweiler, R. (1999). The SWISS-PROTprotein sequence data bank and its supplement TrEMBL in 1999. Nucl. AcidsRes. 27 49-54.

[Bateman et al. 1999] Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R.D., & Sonnhammer E. L. (1999). Pfam 3.1: 1313 multiple alignments and profileHMMs match the majority of proteins. Nucl. Acids Res. 27, 260-262.

[Black & Mould 1991] Black, S.D. & Mould, D.R. (1991). Development of Hy-drophobicity Parameters to Analyze Proteins Which Bear Post or Cotransla-tional Modifications. Anal. Biochem. 193, 72-82.

[Csiszr] Csiszr, I. Information Theoretic Methods in Probability and Statistics.From citeseer.nj.nec.com

[Dembo & Karlin 1991] Dembo, A. & Karlin, S. (1991). Strong limit theorems ofempirical functionals for large exceedances of partial sums of i.i.d variables. Ann.Prob. 19, 1737-1755.

[Ferran et al. 1994] Ferran, E. A., Pflugfelder, B. & Ferrara P. (1994). Self-Organized Neural Maps of Human Protein Sequences. Protein Sci. 3, 507-521.

[George & Heringa 2002] George, R. A. & Heringa, J. (2002). Protein domainidentification and improved sequence similarity searching using PSI-BLAST.Proteins 48, 672-681.

166

[George & Heringa 2002] George, R. A. & Heringa, J. (2002). SnapDRAGON: amethod to delineate protein structural domains from sequence data. J. Mol.Biol. 316, 839-851.

[George et al. 1996] George, D. G., Barker, W. C., Mewes, H. W., Pfeiffer, F.& Tsugita, A. (1996). The PIR-International protein sequence database. Nucl.Acids. Res. 24, 17-20.

[Gilbert & Glynias 1993] Gilbert, W. & Glynias, M. (1993). On the ancient natureof introns. Gene 135, 137-144.

[Gilbert et al. 1997] Gilbert, W., de Souza, S. J. & Long, M. (1997). Origin ofgenes. Proc. Natl Acad. Sci. USA 94, 7698-7703.

[Gouzy et al. 1999] Gouzy, J., Corpet, F. & Kahn, D. (1999). Whole genome pro-tein domain analysis using a new method for domain clustering. Comput Chem.23, 333-340.

[Gracy & Argos 1998] Gracy, J. & Argos, P. (1998). Automated protein sequencedatabase classification. I. Integration of copositional similarity search, local sim-ilarity search and multiple sequence alignment. II. Delineation of domain bound-ries from sequence similarity. Bioinformatics 14:2, 164-187.

[Guan & Du 1998] Guan, X. & Du, L. (1998). Domain identification by clusteringsequence alignments. Bioinformatics 14, 783-788.

[Ireland & Kullback 1968] Ireland, C. T. & Kullback, S. (1968). Contingency ta-bles with given marginals. Biometrika 55, 179-189.

[Haft et al. 2001] Haft, D. H., Loftus, B. J., Richardson, D. L., Yang, F., Eisen, J.A., Paulsen, I. T. & White, O. (2001). TIGRFAMs: a protein family resourcefor the functional identification of proteins. Nucl. Acids Res. 29, 41-43.

[Henikoff & Henikoff 1992] Henikoff, S. & Henikoff, J. G. (1992). Amino acid sub-stitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915-10919.

[Henikoff & Henikoff 1994] Henikoff, S. & Henikoff, J.G. (1994). Position-basedsequence weights. J. Mol. Biol. 243, 574-578.

[Henikoff & Henikoff 1996] Henikoff, J. G. & Henikoff, S. (1996). Using substi-tution probabilities to improve position-specific scoring matrices. Comp. App.Biosci. 12:2, 135-143.

[Holm & Sander 1994] Holm, L. & Sander, C. (1994). Parser for protein foldingunits. Proteins 19, 256-268.

167

[Hubbard et al. 1999] Hubbard, T. J., Ailey, B., Brenner, S. E., Murzin, A. G.& Chothia, C. (1999). SCOP: a Structural Classification of Proteins database.Nucl. Acids Res. 27, 254-256.

[Karlin & Altschul 1990] Karlin, S. & Altschul, S. F. (1990). Methods for assessingthe statistical significance of molecular sequence features by using general scoringschemes. Proc. Natl Acad. Sci. USA 87, 2264-2268.

[Kullback 1959] Kullback, S. (1959). ”Information theory and statistics”. JohnWiley and Sons, New York.

[Kuroda et al. 2000] Kuroda, Y., Tani, K., Matsuo, Y. & Yokoyama, S. (2000).Automated search of natively folded protein fragments for high-throughputstructure determination in structural genomics. Protein Sci. 9, 2313-2321.

[Lesk & Rose 1981] Lesk, A. M. & Rose, G. D. (1981). Folding units in globularproteins. Proc. Natl. Acad. Sci. USA 78, 4304-4308.

[Lin 1991] Lin, J. (1991). Divergence measures based on the Shannon entropy.IEEE Trans. Info. Theory 37:1, 145-151.

[McGuffin et al. 2000] McGuffin, L. J. , Bryson, K. & Jones, D. T. (2000). ThePSIPRED protein structure prediction server. Bioinformatics 16, 404-405.

[Murzin et al. 1995] Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C.(1995). SCOP: a structural classification of proteins database for the investiga-tion of sequences and structures. J. Mol. Biol. 247, 536-540.

[Orengo et al. 1997] Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T.,Swindells, M. B. & Thornton, J. M. (1997). CATH-a hierarchic classificationof protein domain structures. Structure 5, 1093-1108.

[Park & Teichmann 1998] Park, J. & Teichmann, S. A. (1998). DIVCLUS: an au-tomatic method in the GEANFAMMER package that finds homologous domainsin single- and multi-domain proteins. Bioinformatics 14:2, 144-150.

[Pazos et al. 1997] Pazos, F., Helmer-Citterich, M., Ausiello, G. & Valencia, A.(1997). Correlated mutations contain information about protein-protein inter-action. J. Mol. Biol. 271, 511-523.

[Pearl 1997] Pearl, J. (1997). ”Probabilistic Reasoning in Intelligent Systems: Net-works of Plausible Inference.” Morgan Kaufmann Publishers Inc., San Mateo,California.

[Pearson & Lipman 1988] Pearson, W. R. & Lipman, D. J. (1988). Improved toolsfor biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444-2448.

168

[Ponting et al. 1999] Ponting, C. P., Schultz, J., Milpetz, F. & Bork, P. (1999).SMART: identification and annotation of domains from signalling and extracel-lular protein sequences. Nucl. Acids Res. 27, 229-232.

[Rigden 2002] Rigden, D. J. (2002). Use of covariance analysis for the prediction ofstructural domain boundaries from multiple protein sequence alignments. Pro-tein Eng. 15, 65-77.

[Rose 1979] Rose, G. D. (1979). Hierarchic organization of domains in globularproteins. J. Mol. Biol. 134, 447-470.

[Saxonov et al. 2000] Saxonov, S. , Daizadeh, I. , Fedorov, A. & Gilbert, W.(2000). EID: the Exon-Intron Database-an exhaustive database of protein-codingintron-containing genes. Nucl. Acids Res. 28, 185-190.

[Sonnhammer & Kahn 1994] Sonnhammer, E. L. L. & Kahn, D. (1994). Modulararrangement of proteins as inferred from analysis of homology. Protein Sci. 3,482-492.

[Sonnhammer et al. 1997] Sonnhammer, E. L., Eddy, S. R., Durbin, R. (1997).Pfam: a comprehensive database of protein domain families based on seed align-ments. Proteins 28, 405-420.

[Sowdhamini & Blundell 1995] Sowdhamini, R. & Blundell, T. L. (1995). An au-tomatic method involving cluster analysis of secondary structures for the iden-tification of domains in proteins. Protein Sci. 4, 506-520.

[Taylor 1999] Taylor, W. R. (1999). Protein structural domain identification. Pro-tein Eng. 12, 203-216.

[Westbrook et al. 2002] Westbrook, J., Feng, Z., Jain, S. et al. (2002). The ProteinData Bank: unifying the archive. Nucl. Acids. Res. 30, 245-248

[Wheelan et al. 2000] Wheelan, S. J., Marchler-Bauer, A. & Bryant, S. H. (2000).Domain size distributions can predict domain boundaries. Bioinformatics 16,613-618.

[Yona & Levitt 2000b] Yona, G. & Levitt, M. (2000). Towards a complete mapof the protein space based on a unified sequence and structure analysis of allknown proteins. In the proceedings of ISMB 2000, 395-406, AAAI press, MenloPark.

[Yona et al. 1999] Yona, G., Linial, N. & Linial, M. (1999). ProtoMap: Automaticclassification of protein sequences, a hierarchy of protein families, and local mapsof the protein space. Proteins, 37, 360-378.

CHAPTER 6

FUTURE WORK

6.1 Extensions to the bagFFT algorithm

While the bagFFT algorithm is asymptotically the fastest known algorithm for

computing the p-value of the G2 statistic it can be slower than Hirji’s algorithm

for small n and K. A possible improvement to bagFFT to remedy this can be

based on the csFFT technique described in Chapter 3. Also, extending bagFFT

for Pearson’s X2 statistic and for log-linear models, as well as a generalization to

two-column contingency tables are natural directions for future research in this

area [Baglivo et al., 1992].

6.2 Alignment significance in alternate models

An important assumption used in sFFT and csFFT is that alignment columns are

independently generated under the null hypothesis. This assumption is however

typically not borne out in genomic DNA that we would consider “random” (non-

coding sequences far away from regulatory regions). To correct for this many motif

finders use a higher-order markov model for the null-hypothesis [Liu et al., 2001,

Thijs et al., 2001, Bailey and Elkan, 1994]. The significance of the motif is then

evaluated by sampling from the distribution of motif scores obtained by the motif

finder on random sequences from the null model. The sampling process is however

very slow (it requires a call to the motif finder for every sample) and is not suitable

for the kind of optimization we do in Chapter 4. Extending the techniques in

Chapter 3 or designing new algorithms for this problem is an interesting open

problem.

169

170

Two related problems where techniques similar to those in Chapter 3 may

apply occur in the motif scanning problem. Here the motif model is known and

we wish to scan genomic DNA to find significant matches. While there exist

efficient solutions when the columns of the motif are independent, solving the

problem in the important case where motif columns are correlated is still an open

problem. With the availability of multiple genomes, recent work has explored the

use of phylogenetic models to simultaneously search for motif matches upstream of

orthologous genes [Moses et al., 2004]. The question of efficiently estimating the

significance of motifs under these phylogenetic models is also an interesting avenue

for future research.

6.3 Improvements to Conspv and Gibbspv

The motif finders in Chapter 4 were restricted to the assumption that every se-

quence has exactly one occurence of the motif of interest. This is however an

unrealistic assumption and in practice input sequences may have 0 or multiple

occurences of the motif. We hope to explore ideas similar to those in Chapter 4 in

this more general framework as part of our future work.

6.4 Improved protein domain delineation

In recent years there have been several studies on the subject of domain delineation

[Kim et al., 2005, Tanaka et al., 2006, Miyazaki et al., 2006, Liu and Rost, 2004,

Sim et al., 2005, Dumontier et al., 2005, Gewehr and Zimmer, 2006] and some of

them have followed our framework of using neural networks to analyze multiple

alignments for predicting protein domains [Liu and Rost, 2004, Sim et al., 2005].

171

While these methods typically report performance improvements over first genera-

tion tools such as Prodom, on an absolute scale the results are still unsatisfactory

and far from the goal of reliable domain delineation that is important for tasks such

as protein classification and predicting domain interactions. We believe that the

next generation of tools can be develped based on a combination of the following

ideas:

• Constructing multiple alignments in conjunction with domain delineation:

The method described in Chapter 5 currently works by constructing a multi-

ple alignment and then using the alignment to delineate domains. However,

the alignment process itself could greatly benefit from knowledge of where

the domain boundaries are. In the present setup, errors in the multiple-

alignment could propagate to errors in domain delineation, with no scope for

correction of the alignment in the presence of conflicting information from

the domain delineation step. Two obvious solutions that could work are:

1. Iteratively use the domain definitions from our method to improve the

multiple alignment and then use the new multiple alignment to get new

domain definitions till the process converges.

2. Modify the scoring scheme in a progressive multiple alignment tool to

use domain delineation signals (such as the output form the neural

network in our method) from subalignments.

• Phylogenetic analysis of alignment columns: An important source of infor-

mation that is missing in the analysis done in our method is the evolutionary

tree that connects the sequences in the multiple alignment. For many of our

scores taking the phylogeny into account would be valuable to weight the

172

information obtained from various sequences. Typically, however, the phy-

logeny of the sequences in our alignments in unknown and we would need to

infer it computationally. While this is a difficult problem in itself, techniques

to integrate over phylogenies can help us cope with the uncertainity in the

phylogeny [Jin et al., 2006, Kosiol et al., 2006].

• Learning multiple models: Bagging and Boosting are two commmonly used

techniques in machine learning to improve the performance of a classifier

[Schwenk and Bengio, 2000]. These techniques could also be valuable in our

method if it is indeed the case that different domain families have different

sets of rules that define their domain boundaries.

173

BIBLIOGRAPHY


[Bailey and Elkan, 1994] Bailey,T.L. and Elkan,C. (1994) Fitting a mixture modelby expectation maximization to discover motifs in biopolymers. Proc Int ConfIntell Syst Mol Biol, 2, 28–36.

[Dumontier et al., 2005] Dumontier,M., Yao,R., Feldman,H.J. and Hogue,C.W.V.(2005) Armadillo: domain boundary prediction by amino acid composition. JMol Biol, 350 (5), 1061–1073.

[Gewehr and Zimmer, 2006] Gewehr,J.E. and Zimmer,R. (2006) SSEP-Domain:protein domain prediction by alignment of secondary structure elements andprofiles. Bioinformatics, 22 (2), 181–187.

[Jin et al., 2006] Jin,G., Nakhleh,L., Snir,S. and Tuller,T. (2006) Inferring Phylo-genetic Networks by the Maximum Parsimony Criterion: A Case Study. MolBiol Evol.

[Kim et al., 2005] Kim,D.E., Chivian,D., Malmstrm,L. and Baker,D. (2005) Au-tomated prediction of domain boundaries in CASP6 targets using Ginzu andRosettaDOM. Proteins, 61 Suppl 7, 193–200.

[Kosiol et al., 2006] Kosiol,C., Bofkin,L. and Whelan,S. (2006) Phylogenetics bylikelihood: evolutionary modeling as a tool for understanding the genome. JBiomed Inform, 39 (1), 51–61.

[Liu and Rost, 2004] Liu,J. and Rost,B. (2004) Sequence-based prediction of pro-tein domains. Nucleic Acids Res, 32 (12), 3522–3530.

[Liu et al., 2001] Liu,X., Brutlag,D.L. and Liu,J.S. (2001) BioProspector: discov-ering conserved DNA motifs in upstream regulatory regions of co-expressedgenes. Pac Symp Biocomput, 127–138.

[Miyazaki et al., 2006] Miyazaki,S., Kuroda,Y. and Yokoyama,S. (2006) Identifi-cation of putative domain linkers by a neural network - application to a largesequence database. BMC Bioinformatics, 7, 323.

[Moses et al., 2004] Moses,A.M., Chiang,D.Y., Pollard,D.A., Iyer,V.N. andEisen,M.B. (2004) MONKEY: identifying conserved transcription-factor bind-ing sites in multiple alignments using a binding site-specific evolutionary model.Genome Biol, 5 (12), R98.

174

[Schwenk and Bengio, 2000] Schwenk,H. and Bengio,Y. (2000) Boosting neuralnetworks. Neural Comput, 12 (8), 1869–1887.

[Sim et al., 2005] Sim,J., Kim,S.Y. and Lee,J. (2005) PPRODO: prediction of pro-tein domain boundaries using neural networks. Proteins, 59 (3), 627–632.

[Tanaka et al., 2006] Tanaka,T., Yokoyama,S. and Kuroda,Y. (2006) Improvementof domain linker prediction by incorporating loop-length-dependent characteris-tics. Biopolymers, 84 (2), 161–168.

[Thijs et al., 2001] Thijs,G., Lescot,M., Marchal,K., Rombauts,S., Moor,B.D.,Rouz,P. and Moreau,Y. (2001) A higher-order background model improves thedetection of promoter regulatory elements by Gibbs sampling. Bioinformatics,17 (12), 1113–1122.

Documents

STATISTICAL TECHNIQUES FOR BIOLOGICAL MOTIF DISCOVERYniranjan/papers/NagarajanThesis07.pdf · STATISTICAL TECHNIQUES FOR BIOLOGICAL MOTIF DISCOVERY Niranjan Nagarajan, Ph.D. Cornell