DISCRIMINATIVE STRUCTURED MODELS FOR BIOLOGICAL …ai.stanford.edu/~chuongdo/papers/thesis_2_sided.pdf · DISCRIMINATIVE STRUCTURED MODELS FOR BIOLOGICAL ... Vasco Chatalbashev, Adam

DISCRIMINATIVE STRUCTURED MODELS FOR BIOLOGICAL

SEQUENCE ANALYSIS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Chuong B. Do

September 2009

c© Copyright by Chuong B. Do 2009

All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it is fully

adequate in scope and quality as a dissertation for the degree of Doctor of

Philosophy.

(Andrew Y. Ng) Principal Advisor



Philosophy.

(Serafim Batzoglou)



Philosophy.

(Arend Sidow)

Approved for the University Committee on Graduate Studies.

iii

iv

Abstract

Making predictions is a key element in many computational biology applications: given

a set of input biological sequences, use an inference procedure to generate some corre-

sponding predicted output. The prediction process involves defining an appropriate scoring

model for comparing alternative output predictions, developing efficient inference algo-

rithms for choosing high-scoring outputs, and choosing scoring models so that the gener-

ated predictions are biologically meaningful. Traditionally, research in predictive methods

for biology has focused on effective inference algorithms for a variety of possible scoring

models. In practice, however, the methods used to estimate the scoring parameters for these

models often rely on an eclectic combination of techniques,ranging from ad hoc statistical

analysis and physicochemical arguments to manual trial-and-error.

In this thesis, we consider the problem of scoring parameterestimation for three key

problems in computational biology: protein sequence alignment, RNA secondary structure

prediction, and RNA simultaneous folding and alignment. We formulate the model esti-

mation task as a special class of supervised machine learning problems where the goal is

to learn a mapping from a structured input space (e.g., aminoacid or RNA sequences) to

a structured output space (e.g., alignments or foldings). Under this framework, the prob-

lem of model estimation reduces to solving a convex optimization problem. Following this

setup, we design structured probabilistic or max-margin models for each task. To allow

our algorithms to scale efficiently to large-scale trainingsets, we develop new fast online

and batch convex optimization algorithms specially tailored for learning structured models.

We also develop an automated approach for designing custom regularization penalties to

prevent overfitting in feature-rich scoring models.

The resulting software packages for alignment (CONTRAlign),secondary structure

v

prediction (CONTRAfold), and simultaneous alignment and folding (RAF) each obtain

state-of-the-art accuracy in their respective domains. Inparticular, our alignment algo-

rithm, CONTRAlign, obtains substantially improved sensitivity for the difficult class of

“twilight zone” alignments. Our RNA secondary structure prediction algorithm, CON-

TRAfold, achieves higher general accuracy than existing classical methods, demonstrating

for the first time that a statistically estimated scoring model can outperform thermodynamic

approaches. Finally, our RNA simultaneous folding and alignment program, RAF, achieves

high accuracies while also taking advantage of new sparsityheuristics to achieve running

times orders of magnitude faster than previous approaches.

vi

To my parents, Luan and Lee Do.

vii

Acknowledgments

When I arrived on the campus of Stanford University in the fallof 2000, I had few designs

of pursuing a graduate degree, much less studying computer science. During my time

at Stanford University, I have had the fortune of meeting so many wonderful individuals,

people who had a hand in shaping my aspirations, ideas, and experiences.

First, I warmly thank my advisor, Andrew Ng, for his thoughtful guidance and support

throughout the course of my graduate career. To anyone who has had the privilege of work-

ing with him, Andrew’s keen technical prowess, clarity of thought in both conversation and

writing, and unflagging excitement for research are obvious. Even more impressive than

these, however, are his fearlessness in tackling tough challenges head on, his uncompro-

mising efficiency in obtaining results, and his caring dedication toward his students. All

these lessons I will cherish and take with me in my post-graduate career.

Second, I also thank my co-advisor, Serafim Batzoglou, for hisadvice and support

throughout these years. The summer I spent in Serafim’s groupfollowing my undergrad-

uate sophomore year was largely responsible for convincingme to pursue an advanced

degree in computer science. I am grateful to Serafim for providing me with so many op-

portunities to take responsibility in big projects, for fostering a fun and productive lab

environment, and for always maintaining an infectious optimistic attitude in any endeavor.

I thank the members of my thesis committee—Andrew, Serafim, and Arend Sidow—

as well as my defense committee members—Daphne Koller and RobTibshirani—for their

insightful commentary and advice. Rob took on the task of chairing my defense committee,

Daphne has always been an inspirational example both in teaching and research, and Arend

has been always been a great source for biological insight and candid commentary in all

matters scientific.

viii

As a Ph. D. student in a cross-disciplinary field, I found myself with two homes: one in

the William Gates Computer Science Building as part of the Artificial Intelligence (AI) lab,

and another in the James H. Clark Center as part of the Computational Biology labs. Luck-

ily, I had the fortune of interacting and learning from a number of energetic and amazing

individuals in both places.

In the Gates Building, I had the privilege of sharing an office with Rajat Raina and

Rion Snow, and I also had pleasure of spending time with other students in the Ng group,

AI lab, and computer science Ph. D. program in general: Pieter Abbeel, David Arthur,

Vasco Chatalbashev, Adam Coates, Zico Kolter, Quoc Le, Su-In Lee, Honglak Lee, Mor-

gan Quigley, Olga Russakovsky, Ashutosh Saxena, Yirong Shen, David Vickrey, Thuc Vu,

and Haidong Wang. As a student coach of the Stanford’s Association for Computing Ma-

chinery International Collegiate Programming Contest (ACM-ICPC) team, I had a great

time working with Sonny Chan, Jerry Cain, Claire Stager, Miles Davis, and the various

members of the Stanford programming teams over the years.

Similarly, my experience in the Clark Center would also not have been the same without

the company of the extended Batzoglou lab: Sarah Aerni, George Asimenos, Leticia Britos,

Mike Brudno, Tiffany Chen, Annie Chiang, Eugene Davydov, OmkarDeshpande, Bob

Edgar, Jason Flannick, Eugene Fratkin, Chuan-Sheng Foo, Greg Goldgof, Samuel Gross,

Clare Kasemset, Aswath Manoharan, Antal Novak, Jesse Rodriguez, Marc Schaub, Marina

Sirota, Balaji Srinivasan, Andreas Sundquist, and Daniel Woods.

In particular, I would like to put in a special mention for a number of individuals who

played a particularly important role in shaping my growth asa scientist, researcher, and

individual. I had the pleasure of working under the tutelageof Mike Brudno as an un-

dergraduate, from whom I inherited a passion for algorithmic thinking. My discussions

with Bob Edgar instilled in me the recognition that good science is not just about having

novel ideas but equally as much about disseminating these ideas to the scientific commu-

nity. Much of the work in thesis I would not have even pursued without the encouragement

of Sam Gross, who taught me that technical skills are only a small part of being a good

scientist; equally important is the courage to believe in one’s own ideas and the mental

fortitude to see them through, even in the face of skeptics and detractors.

ix

Finally, I would like to thank my family for their unconditional love and support through-

out this entire process. In particular, I thank my parents, Luan and Lee, and my brother,

Tim, whose daily sacrifices gave me the opportunity to pursuemy dreams, and who never

failed to gently remind me that there is life beyond school. And finally, to Karen Lee, my

patient companion, my devoted confidant, and my ever-willing accomplice in life and love,

your constant faith in me gave me the will to move forward; thank you for providing the

inspiration and strength that made this thesis possible.

x

Contents

Abstract v

Acknowledgments viii

1 Introduction 1

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 8

2.1 Protein sequence alignment . . . . . . . . . . . . . . . . . . . . . . . .. . 8

2.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Computational formulation . . . . . . . . . . . . . . . . . . . . . . 12

2.2 RNA secondary structure prediction . . . . . . . . . . . . . . . . . .. . . 15

2.2.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Computational formulation . . . . . . . . . . . . . . . . . . . . . . 18

2.3 RNA simultaneous alignment and folding . . . . . . . . . . . . . . .. . . 21

2.4 Structured prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25

3 Protein sequence alignment 27

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Pair-HMMs for sequence alignment . . . . . . . . . . . . . . . . .31

3.2.2 From pair-HMMs to pair-CRFs . . . . . . . . . . . . . . . . . . . 33

xi

3.2.3 Pairwise alignments with CONTRAlign . . . . . . . . . . . . . . . 35

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 Cross-validation methodology . . . . . . . . . . . . . . . . . . . .38

3.3.2 Comparison of model topologies and feature sets . . . . . .. . . . 39

3.3.3 Comparison to modern sequence alignment tools . . . . . . .. . . 41

3.3.4 Regularization and generalization performance . . . . .. . . . . . 43

3.3.5 Alignment accuracy in the “twilight zone” . . . . . . . . . .. . . . 44

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 RNA secondary structure prediction 46

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.1 Modeling secondary structure with SCFGs . . . . . . . . . . . .. 49

4.2.2 From SCFGs to CLLMs . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.3 From energy-based models to CLLMs . . . . . . . . . . . . . . . . 55

4.2.4 The CONTRAfold model . . . . . . . . . . . . . . . . . . . . . . 57

4.2.5 Maximum expected accuracy parsing with sensitivity/specificity

tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.1 Comparison to generative training . . . . . . . . . . . . . . . . .. 68

4.3.2 Comparison to other methods . . . . . . . . . . . . . . . . . . . . 69

4.3.3 Feature assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.4 Learned versus measured parameters . . . . . . . . . . . . . . .. 71

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Hyperparameter learning 75

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3 Learning multiple hyperparameters . . . . . . . . . . . . . . . . .. . . . . 79

5.4 The hyperparameter gradient . . . . . . . . . . . . . . . . . . . . . . .. . 80

5.4.1 Deriving the hyperparameter gradient . . . . . . . . . . . . .. . . 80

5.4.2 Computing the hyperparameter gradient efficiently . . .. . . . . . 81

xii

5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.6 Discussion and related work . . . . . . . . . . . . . . . . . . . . . . . .. 89

6 Simultaneous alignment and folding 92

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2.1 The RAF scoring model . . . . . . . . . . . . . . . . . . . . . . . 95

6.2.2 Fast pairwise alignment and folding . . . . . . . . . . . . . . .. . 96

6.2.3 Extension to multiple alignment . . . . . . . . . . . . . . . . . .. 98

6.2.4 A max-margin framework . . . . . . . . . . . . . . . . . . . . . . 99

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.1 Alignment and base-pairing constraints . . . . . . . . . . .. . . . 108

6.3.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.3.3 Comparison of accuracy . . . . . . . . . . . . . . . . . . . . . . . 109

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7 Proximal regularization 113

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.3 Online proximal learning . . . . . . . . . . . . . . . . . . . . . . . . . .. 116

7.3.1 Proximal regret bound . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3.2 Choosing proximal parameters . . . . . . . . . . . . . . . . . . . . 118

7.3.3 Application: Linear SVMs . . . . . . . . . . . . . . . . . . . . . . 120

7.3.4 An optimistic strategy . . . . . . . . . . . . . . . . . . . . . . . . 122

7.4 Batch proximal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.5.1 Online learning with binary classification . . . . . . . . .. . . . . 127

7.5.2 Batch learning with RNA folding and web ranking . . . . . . . .. 130

7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8 Conclusion 135

xiii

A Appendix for Chapter 4 140

A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

A.2 Basic feature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.2.1 Hairpins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.2.2 Single-branched loops . . . . . . . . . . . . . . . . . . . . . . . . 144

A.2.3 Helices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.2.4 Multi-branched loops . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.3 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .150

A.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

A.3.2 Recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

A.4 The inside algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

A.5 The outside algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .157

A.6 Posterior decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

A.7 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

B Apppendix for Chapter 6 162

B.1 RAF features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

B.2 The RAF inference engine . . . . . . . . . . . . . . . . . . . . . . . . . . 163

B.2.1 Recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

B.2.2 Exploiting base-pairing sparsity . . . . . . . . . . . . . . . . .. . 164

B.2.3 Exploiting alignment sparsity . . . . . . . . . . . . . . . . . . . .165

B.3 Norm bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

C Appendix for Chapter 7 170

C.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

C.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

C.3 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171


C.5 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173


C.7 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

C.8 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

xiv

List of Tables

2.1 The twenty naturally occurring amino acids. . . . . . . . . . .. . . . . . . 9

3.1 Comparison of CONTRAlign variants . . . . . . . . . . . . . . . . . . . . 40

3.2 Comparison of modern alignment methods. . . . . . . . . . . . . . .. . . 41

4.1 Comparison of generative and discriminative model structure prediction

accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 Comparison of MEA and Viterbi structure prediction accuracy. . . . . . . . 67

4.3 Accuracies of leading secondary structure prediction methods. . . . . . . . 70

4.4 Performance of CONTRAfold relative to leading secondary structure pre-

diction methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5 Ablation analysis of CONTRAfold model. . . . . . . . . . . . . . . . .. . 72

5.1 Grouped hyperparameters learned using our algorithm for each of the two

cross-validation folds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

6.1 Comparison of computational complexity of RNA simultaneous folding

and alignment algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.1 Convergence ofPegasos, Adaptive, andProximalon nine binary classifica-

tion tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

xv

List of Figures

2.1 Alignment of human myoglobin and human hemoglobinβ-subunit. . . . . 11

3.1 Traditional sequence alignment model. . . . . . . . . . . . . . .. . . . . . 30

3.2 CONTRAlign model variants. . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Alignment accuracy curves. . . . . . . . . . . . . . . . . . . . . . . . .. . 43

4.1 Position label notation in CONTRAfold. . . . . . . . . . . . . . . . .. . . 54

4.2 Scoring model for hairpin loops. . . . . . . . . . . . . . . . . . . . .. . . 59

4.3 Scoring model for internal and bulge loops. . . . . . . . . . . .. . . . . . 60

4.4 Scoring model for stacking pairs and helices. . . . . . . . . .. . . . . . . 60

4.5 Scoring model for multi-branch loops. . . . . . . . . . . . . . . .. . . . . 61

4.6 ROC plot comparing sensitivity and specificity for several RNA structure

prediction methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.7 Comparison of learned and experimentally measured stacking energies. . . 72

5.1 Pseudocode for gradient computation . . . . . . . . . . . . . . . .. . . . 82

5.2 HMM simulation state diagram. . . . . . . . . . . . . . . . . . . . . . .. 85

5.3 HMM performance when varying number of features. . . . . . .. . . . . . 86

5.4 HMM performance when varying number of training examples. . . . . . . 87

5.5 RNA secondary structure prediction. . . . . . . . . . . . . . . . . .. . . . 88

6.1 Sparsity patterns in posterior probability matrices. .. . . . . . . . . . . . . 94

6.2 Trade-off between sparsity factor and proportion of reference base-pairings

or aligned matches covered when varying the cutoffsǫpairedandǫaligned. . . . 108

6.3 Performance comparison on BRAliBASE II datasets. . . . . . . .. . . . . 110

xvi

6.4 Performance comparison on MASTR benchmarking sets. . . .. . . . . . . 111

7.1 Convergence ofPegasos, Adaptive, andProximalfor combined and covtype. 128

7.2 Test errors ofPegasos, Adaptive, andProximal for combined and covtype

during the course of optimization. . . . . . . . . . . . . . . . . . . . . .. 129

7.3 Comparison of a standard bundle method to proximal bundlemethod for

SVM structured learning with various choices ofλ. . . . . . . . . . . . . . 131

A.1 Position label notation in CONTRAfold. . . . . . . . . . . . . . . . .. . . 141

A.2 List of all potentials used in the CONTRAfold model. . . . . . .. . . . . . 142

A.3 Hairpin loop diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .144

A.4 Single-branched internal loop diagram. . . . . . . . . . . . . .. . . . . . . 144

A.5 Helix diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.6 Multi-branched loop diagram. . . . . . . . . . . . . . . . . . . . . . .. . 148

xvii

xviii

Chapter 1

Introduction

Throughout the last decade, the defining characteristic of progress in molecular biology

research has been rapid increase in scale of biological experiments. Today, a single mi-

croarray experiment provides expression measurements forthousands of genes subject to

hundreds of conditions; similarly, high-throughput sequencing technologies allow quan-

titative analysis of environmental samples containing thousands of microbial organisms.

With the rate of data collection far exceeding the capacity of humans to analyze these data

manually, computational techniques have assumed an increasingly important role in mod-

ern biology. Computational systems are instrumental not only in processing and organizing

experimental data (i.e., the study of bioinformatics) but also for performing analyses with

the purpose of yielding novel biological insights (i.e., the study of computational biology).

Developing algorithmic methods for the analysis of biological data involves a number of

significant challenges. First, one must identify the biological question to be answered and

provide a mathematical formulation that makes clear the modeling assumptions being made

and the overall goal of the computation. Next, one must devise appropriate computational

methods, including data structures and algorithms, neededto perform the desired analysis.

Finally, one must fit the model to real data so that the resulting inferences made using the

learned model will be predictive of biological reality.

For many problems in computational biology, appropriate computational models have

matured over time, and efficient algorithms for making inferences have been the topic of

1

2 CHAPTER 1. INTRODUCTION

active research for decades. Parameter estimation methodsfor these tasks, however, tradi-

tionally draw on a diverse array of techniques applied in ad hoc or heuristic ways. These

techniques, while problem-specific and requiring a significant amount of biological insight,

often yield suboptimal accuracy in practice, or parametersthat overfit training data.

Consider, for example, the following classic applications in computational biology:

• Protein sequence alignment.In the problem of protein sequence alignment, one

is given a collection of amino acid sequences representing proteins, and the goal

is to identify groups of amino acid residues from these sequences that are related

either functionally or evolutionarily. Computational formulations of sequence align-

ment have traditionally relied on scoring models and algorithms which derive largely

from the concept of “edit or Levenshtein distance” (Levenshtein, 1966) in computer

science. Efficient algorithms for computing optimal sequence alignments in the pair-

wise case date back to the classic dynamic programming method of Needleman and

Wunsch (1970). However, estimation of parameters for the scoring model used in the

Needleman-Wunsch algorithm is often considered an “art,” requiring tedious manual

adjustment to obtain good accuracy.

• RNA secondary structure prediction. In the problem of RNA secondary struc-

ture prediction, one is given a single RNA sequence, and the goal is to identify the

pattern of base-pairings that will form when the RNA foldsin vivo. Again, the tradi-

tional algorithms for RNA folding date back to the dynamic programming algorithm

of Nussinov and Jacobson (1980) and its later variants basedon thermodynamic mod-

els (Tinoco et al., 1971). For RNA folding algorithms, parameter estimation has

traditionally relied on the ability of scientists to perform detailed physicochemical

experiments in order to isolate the free energies of specifictypes of structural inter-

actions. Because of the difficulty of performing these experiments in practice, some

of the parameters in traditional thermodynamic models are actually manually tuned

via trial-and-error (Mathews et al., 1999).

• RNA simultaneous alignment and folding. In the problem of RNA simultaneous

alignment and folding, one is given a pair of RNA sequences which are assumed

to share some common structural fold, and the goal is to predict their consensus

3

folding (i.e., align the nucleotides bases from each sequence, and predict the base-

pairings that are common between the secondary structures of the two sequences).

Historically, most of the effort in simultaneous alignmentand folding approaches

for RNA have focused squarely on the development of efficient algorithms for in-

ference, largely because of the practical intractability of straightforward dynamic

programming algorithms for this task. Conversely, parameter estimation methods for

simultaneous folding and alignment algorithms are rarely addressed in the literature,

with most researchers going to great pains to ensure that their models have extremely

few parameters so as to facilitate manual tuning.

In this thesis, we consider the problem of parameter estimation in each of these three key

problems: protein sequence alignment, RNA secondary structure prediction, and RNA si-

multaneous folding and alignment. We provide a unifying framework that addresses the

parameter estimation task in computational biology problems with principled machine

learning-based optimization procedures. At a high level, the goal of our work is to establish

a standard set of techniques applicable to a wide variety of problems arising in biological

sequence analysis that replaces the guesswork usually associated with model fitting. At the

low level, we also develop new machine learning algorithms with the goal of fine-tuning

specific aspects of our models for maximum performance.

The use of machine learning techniques in computational biology is not new. Pair

hidden Markov models (pair-HMMs), for instance, have been used for sequence align-

ment (Durbin et al., 1998) and stochastic context-free grammars (SCFGs) have long been

popular methods for RNA folding (Eddy and Durbin, 1994; Knudsen and Hein, 1999).

Previous applications of machine learning techniques to these problems, however, tended

to produce models that failed to give significant improvements in accuracy compared to

existing computational techniques. For example, in RNA folding, SCFG-based approaches

that were proposed as an alternative to traditional thermodynamics-based models tended

to give lower accuracy than their traditional counterparts(Dowell and Eddy, 2004). As

a consequence, proponents of these early methods tended to emphasize that this reduced

general accuracy was a small price to pay for the convenienceof having a machine learning

framework.


In this thesis, we instead place a high premium on the accuracy of the machine learn-

ing models we learn. Our discriminative parameter learningframework not only provides

a convenient suite of statistical estimation tools but alsoyields performance competitive

with and in many cases exceeding state-of-the-art accuracy. Specifically, in each case,

we formulate the model estimation task as a special class of supervised machine learning

problems where the goal is to learn a mapping from a structured input space (e.g., amino

acid or RNA sequences) to a structured output space (e.g., alignments or foldings). Under

this framework, the problem of model estimation reduces to solving a convex optimization

problem.

To solve the convex optimization problems associated with learning structured mod-

els, we start with off-the-shelf optimization tools, and rely on parallelized implementations

running on large clusters in order to train our models. Thesetypes of approaches, while

appropriate for medium-scale datasets, struggle for larger-scale problems, so in response,

we then develop new optimization algorithms that are well-suited for efficient online or

batch learning with large-scale datasets. In addition to the optimization challenges posed

in learning structured models, the abundance of parametersin these models can lead to

overfitting. To address this, we develop new cross-validation–based algorithms for auto-

matically designing regularization penalties to prevent overfitting.

1.1 Contributions

This thesis makes a number of contributions covering topicsin both the fields of machine

learning and computational biology. On the algorithmic side, our machine learning contri-

butions include the development of a number of techniques for learning structured models:

1. Fast algorithms for online and batch learning (Chapter 7): We describe new

algorithms for faster training of supervised learning models. Our algorithms rely

on the concept of proximal regularization—by adding curvature to existing objective

functions in an adaptive manner, they improve the practicalconvergence rates of both

online and batch optimization algorithms.

2. Optimization-based methods for preventing model overfitting (Chapter 5): In

1.1. CONTRIBUTIONS 5

many machine learning applications, regularization penalties provide a mechanism

for capacity control. Typically, the strengths of these regularization penalties are

in turn controlled by a collection of values known as hyperparameters. We de-

velop algorithms for choosing these hyperparameters in a general class of discrimi-

native probabilistic models, whose instances range from logistic regression models

to discriminatively-trained Markov networks.

On the applied side, our contributions include the formulation of appropriate structured

estimation methods for a variety of problems in computational biology, and the application

of some of the techniques described above to the these models. Specifically, we focus on the

application domains of protein sequence alignment, RNA secondary structure prediction,

and RNA simultaneous alignment and folding:

3. Protein sequence alignment (Chapter 3): We describe a pair conditional random

field (pair-CRF)-based approach to sequence alignment that replaces manual tweak-

ing of traditional ad hoc scoring models with a simple optimization-based frame-

work for parameter learning. We present a program, CONTRAlign, that leverages

the power of the pair-CRF modeling approach to explore a variety of novel scoring

features.

4. RNA secondary structure prediction (Chapter 4): Analogous to our protein align-

ment strategy, we describe a conditional log-linear model (CLLM)-based approach

to single sequence RNA secondary structure prediction that replaces parameters in

existing thermodynamic models of RNA structure with statistically estimated folding

potentials. We present a program, CONTRAfold, that demonstrates for the first time

that statistically learned RNA secondary structure prediction parameters can in fact

outperform standard physics-based methods.

5. RNA simultaneous alignment and folding (Chapter 6): We consider an extension

of our previous two efforts to the more difficult problem of simultaneous alignment

and folding of two RNA sequences. In this work, we develop new sparsity-based

algorithms for fast alignment and folding and a novel max-margin framework for

parameter learning. We present RAF, a tool for fast alignmentof folding of RNA


sequences that not only achieves state-of-the-art accuracy but also achieves running

times orders of magnitude faster than previous approaches.

1.2 Outline

The remainder of this thesis is organized as follows. In Chapter 2, we provide a basic primer

on the biological and machine learning topics addressed in this thesis; our overview covers

the problems of protein sequence alignment, RNA folding, RNA simultaneous alignment

and folding, and a brief introduction to the goals of the structured prediction paradigm in

machine learning. In Chapter 3, we describe the application of our approach to protein

sequence alignment using pair conditional random fields (pair-CRFs), where we present

the program CONTRAlign. In Chapter 4, we tackle the problem of single sequence RNA

secondary structure prediction using conditional log-linear models (CLLMs), where we

present the program CONTRAfold. In Chapter 5, we revisit these two applications while

describing a new gradient-based method for adjusting the regularization penalties of struc-

tured prediction models in order to prevent overfitting. In Chapter 6, we address the prob-

lem of RNA simultaneous alignment and folding using a max-margin approach, where

we present the program RAF. In Chapter 7, we discuss novel online and batch algorithms

for learning structured prediction models, which allows usto extend our methodologies

to larger-scale datasets. Finally, we conclude with a discussion of the applications and

limitations of the work, and some ideas for future directions in Chapter 8.

1.3 Previous work

Many of the chapters in this thesis are based closely on previous publications that appeared

in a number of conferences and journals. Below, we provide a reference for each of the

relevant chapters:

• Chapter 3 is based on material from Do et al. (2006a), as presented at the Tenth

Annual International Conference on Research in ComputationalMolecular Biology

(RECOMB 2006).

1.3. PREVIOUS WORK 7

• Chapter 4 is based on material from Do et al. (2006b), as presented at the Four-

teenth Annual International Conference on Intelligent Systems for Molecular Biol-

ogy (ISMB 2006).

• Chapter 5 is based on material from Do et al. (2007), as presented at the Twentieth

Annual Conference on Neural Information Processing Systems(NIPS 2006).

• Chapter 6 is based on material from Do et al. (2008), as presented at the Sixteenth An-

nual International Conference on Intelligent Systems for Molecular Biology (ISMB

2008).

• Chapter 7 is based on material from Do et al. (2009), as presented at the Twenty-sixth

Annual International Conference on Machine Learning (ICML 2009).

In all of these publications, the dissertation author was responsible for the bulk of the

design, implementation, evaluation, and write-up of the described algorithms.

Chapter 2

Background

The material in this thesis draws on a diverse range of topicsin both computational biology

and machine learning. In this chapter, we begin by giving a brief overview of the three

biological problems addressed in this work. For each problem, we give a concise summary

of the underlying biological principles and goals, a description of the standard computa-

tional formulation, and a sketch of traditional methods forparameter estimation. After

introducing each problem, we then turn our attention to the structured prediction paradigm

in machine learning, where we give a brief overview of the goals and main ideas involved.

2.1 Protein sequence alignment

Proteins are biological macromolecules that play important roles in nearly every process

in living cells. In the classical biological view, proteinsare considered the workhorses of

the cell, taking on a wide variety of tasks critical for life.Some proteins, such as alco-

hol dehydrogenase, act as catalysts for controlling the rate of chemical reactions involved

in metabolism. Other proteins, such as keratin, play structural roles in organizing the cy-

toskeleton of a cell. Proteins, such as actin, can also act asmolecular motors in driving cell

motility.

Chemically, a protein (also known as a polypeptide) is a long linear chain of amino acids

joined together by specific types of chemical bonds known as peptide linkages. Each amino

acid unit has four main components: an amino group (NH2), a carboxyl group (COOH),

8

2.1. PROTEIN SEQUENCE ALIGNMENT 9

Amino acid Abbreviation Amino acid AbbreviationAlanine A Leucine LArginine R Lysine KAsparagine N Methionine MAspartic acid D Phenylalanine FCysteine C Proline PGlutamic acid E Serine SGlutamine Q Threonine TGlycine G Tryptophan WHistidine H Tyrosine YIsoleucine I Valine V

Table 2.1: The twenty naturally occurring amino acids.

an organic substituent (often abbreviated asR), and a central carbon atom to which the

other three components are attached. There are twenty naturally occurring types of amino

acids, distinguished by their attached organic substituents; for convenience, amino acid

residues in a protein sequence are often referred to using their one-letter abbrevations, as

shown in Table 2.1.

In a protein polypeptide, the amino and carboxyl groups, along with the central car-

bon, form the backbone of the amino acid chain. The chemical properties associated with

each organic substituent in the polypeptide, however, dictate how the protein folds within

a cell. On the basis of these organic groups, some amino acidsmay be considered to

have bulky side chains, which impose strong steric constraints on protein structure, while

other amino acids may have extremely hydrophilic (i.e., water-loving) or hydrophobic (i.e.,

water-fearing) side chains that force certain orientationof the amino acid with respect to

the aqueous environment of the cell.

The ordering of amino acids in a polypeptide chain, known as the primary structure of a

protein, is the principal determinant of the protein’s higher-order structural properties: the

physicochemical properties and ordering of the various amino acid residues a protein chain

determine the local substructures (i.e., secondary structure) that form in aqueous solution,

as well as the three-dimensional arrangement of these substructures with respect to each

other (i.e., tertiary structure). For more detail, we referthe interested reader to Chapter 3

of Alberts et al. (2002).

10 CHAPTER 2. BACKGROUND

2.1.1 Problem definition

Here, we focus on a specific computational problem associated with the analysis of protein

sequences known as protein sequence alignment. In the most general formulation of se-

quence alignment, one is given a collection of two or more protein amino acid sequences

which are known to be similar in some respect; for example, the proteins may be ho-

mologous (i.e., share a common evolutionary ancestor), structurally similar (i.e., fold into

similar three-dimensional configurations), or functionally analogous (i.e., have similar re-

sponsibilities in the cell). A sequence alignment (see Figure 2.1) is a visual comparison of

these proteins that highlights amino acid similarities by displaying the amino acid sequence

for each protein on a single line, and inserting gap characters (-) so that all sequences have

the same length. The resulting matrix of characters has rowscorresponding to each original

input sequence, and no column is allowed to contain only gap characters (Do and Katoh,

2008).

Most importantly, whenever an alignment column contains more than one non-gap char-

acter, then all of the non-gap characters in that column are considered to be “equivalent”

residues in the respective sequences. Here, equivalence may have multiple meanings de-

pending on the context in which the protein alignment is created. For example, when

dealing with homologous proteins, equivalent residues arethose amino acids which are

assumed to have derived from the same amino acid in some ancestral sequence; hence,

these are often known as homologous residues. Other types ofequivalence, relating to

structurally or functionally similar proteins, may also bedefined (Morrison, 2006).

The type of alignment described above is specifically known as a global alignment in

that every character of every sequence is present in the alignment, and the order in which

amino acid residues appear in a row of the alignment is identical to the order in which

they appear in the original unaligned protein primary structure. Other variants of protein

alignment exist, however. In a local alignment, the characters in an alignment row need

only correspond to a substring of the original protein; in alignment with rearrangements

and repeats, the characters in a row may appear out of order, or may be repeated several

times. In this thesis, we concern ourselves only with the simple case of global alignment.

Furthermore, we focus on the case of global alignment where the set of sequences being


PROBCONS version 1.12 multiple sequence alignment

sp|P02144|MYG_HUMAN M-GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASsp|P68871|HBB_HUMAN MVHLTPEEKSAVTALWGKVNVDEV--GGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN

* *: * . * :****:.* * *.* **: :* * . *:.* .*.: * : ..

sp|P02144|MYG_HUMAN EDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKsp|P68871|HBB_HUMAN PKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH

.:* ** .** *:.. * : .: :. : .*:: *. * :: : :.::.: :: ** :

sp|P02144|MYG_HUMAN HPGDFGADAQGAMNKALELFRKDMASNYKELGFQGsp|P68871|HBB_HUMAN FGKEFTPPVQAAYQKVVAGVANALAHKYH------

. :* . .*.* :*.: . : :* :*:

Figure 2.1: Alignment of human myoglobin and human hemoglobin β-subunit. Thecolumns of the alignment have been split across three rows inorder to fit on the page.The annotation symbols (*, : and.) are optional characters used to indicate the extent ofsimilarity between the paired amino acids in the corresponding column of the alignemnt.The sequences were obtained from http://www.uniprot.org/and the alignment performedusing our PROBCONS server at http://probcons.stanford.edu/.

compared contains only two sequences, known as pairwise global alignment; our methods

extend with some modifications, to the multiple sequence case.

Before addressing the computational specifics of protein sequence alignment, we first

consider some motivating reasons for why a biologist might care to obtain a protein se-

quence alignment in the first place. Sequence alignments arise in a number of situations:

1. A common goal in many biological studies is to study the structure of a protein ob-

tained from a newly sequenced organism. Structure determination methods, such as

X-ray crystallography and NMR spectroscopy, are difficult and costly undertakings.

Furthermore, accurate de novo structure prediction is an extremely difficult task. In

this situation, having a detailed list of correspondences between sequence positions

in the novel protein and sequence positions in a homologous protein whose structure

has previously been characterized would provide strong clues as to the structure of

the novel protein (Marti-Renom et al., 2000).

2. Taxonomy is the practice of classifying organisms. Beforethe era of modern biol-

ogy, taxonomists traditionally relied on morphological similarities in order to group

organisms into various classes. Today, taxonomists rely heavily on the practice of

cladistics—studying the genetic code of organisms in orderto construct hierarchical


diagrams (i.e., phylogenies) characterizing the evolutionary relationships among a

collection of organisms. To accomplish this requires methods for determining equiv-

alent positions in sequences from different organisms (Morrison, 2006).

3. A population geneticist is often concerned with the rise and fall in abundance of

particular protein variants within a single population. Studying these variants can

often give rise to detailed understanding of the specific elements of protein structure

that contribute to improved evolutionary fitness. To accomplish this again requires

methods for characterizing the similarities and differences between various protein

variants (Stone and Sidow, 2005).

In all of these applications, a common thread is the need for aspecific technique to vi-

sualize the relationships between “equivalent residues” among the various sequences. The

notion of equivalence in sequence alignment is generally context-dependent. As mentioned

earlier, in structural biology, equivalence may simply correspond to analogous positions be-

longing to similar protein structural motifs, whereas in evolutionary biology, equivalence

may require equivalent amino acids to have shared evolutionary ancestry. Regardless of

how equivalent residues are defined, a protein sequence alignment accomplishes the task

of providing a bird’s eye view of the underlying structural or evolutionary constraints char-

acterizing a protein family in a concise, visually intuitive format.

2.1.2 Computational formulation

The first step in designing a computational algorithm for sequence alignment is to devise

an appropriate scoring scheme for prioritizing between thevarious possible pairwise align-

ments of a pair of protein sequences. One of the earliest and still most popular schemes

for alignment scoring is known as the “affine-gap scoring” model. In this model, a gap in

a pairwise alignment is a maximal sequence of consecutive “dash” characters (-) in one

of the two sequences. For example, the alignment in Figure 2.1 has three gaps, a gap of

length one in the first sequence, an internal gap of length twoin the second sequence, and

a terminal gap of length six in the second sequence.

In the affine-gap scoring model, the score of an alignment is asum of the “substitution

scores”wsubst(ai, bj) for each pair of aligned amino acid residuesai andbj from the two


input sequencesa andb, plus “gap penalties”wgap open+ (ℓ − 1)wgap extendfor each gap of

lengthℓ. The number of parameters in the model is fairly limited: thesubstitution ma-

trix wsubst is typically a symmetric20 × 20 table with 210 free parameters, and the two

parameterswgap openandwgap extenddefining the affine gap penalty account for two addi-

tional free parameters. Given a specific choice of these parameters, computing an optimal

global sequence alignment of two sequences can be performedefficiently using dynamic

programming (Gotoh, 1982).

In this thesis, we focus not on algorithms for computing optimal sequence alignments

but rather address the issue of parameter estimation for sequence alignment models. Tra-

ditional parameter estimation schemes for sequence alignment models work by estimating

substitution scores and gap penalty parameters separately:

1. Estimation schemes for substitution scores usually relyon some form of bootstrap-

ping procedure in which an initial heuristically chosen scoring matrix is used to iden-

tify groups of proteins with well-conserved regions. The conserved domains consid-

ered to be sufficiently reliable are then used to estimate a scoring matrix. Many

different schemes for defining the values of this scoring matrix have been defined in

the past.

• In the BLOSUM62 amino acid substitution matrix (Henikoff andHenikoff,

1992), the substitution score for a pair of amino acidsc andc′ is given as

wsubst(c, c′) = log

(

pcc′

qcqc′

)

(2.1)

wherepcc′ is the (normalized) frequency of(c, c′) amino acid equivalences in

the BLOCKS 5.0 database among pairs of sequences clustered with a 62 percent

identity threshold, andqc andqc′ represent the (normalized) marginal frequen-

cies ofc andc′ residues in the same database.

• In the well-known PAM250 (Dayhoff et al., 1978) and GONNET250 (Gonnet

et al., 1992) matrices, the substitution score for a pair of amino acidsc andc′ is


given as

wsubst(c, c′) = log

(

qc · τ (t)c→c′

qcqc′

)

(2.2)

whereqc andqc′ are defined as before, andτ (t)c→c′ is the conditional probability

of a mutated residuec being replaced byc′ overt units of evolutionary time.

In both cases, the formula for scoring matrix entries is taken to be the logarithm of

a ratio between the probability associated with a pair of characters descending from

a common ancestor and the probability of the pair of characters arising indepen-

dently according to a null model; as a result, these types of matrices are known as

log-odds scoring matrices. A number of other substitution matrix models have also

been proposed (Muller and Vingron, 2000; Whelan and Goldman,2001; Prlic et al.,

2000); these matrices typically differ either in the collection of the initial database of

conserved domains, or in the precise log-odds scoring formulation used.

2. The study of gap parameter choice has been far more sparse in the sequence align-

ment literature. The log-odds theoretical justifications for substitution models gener-

ally do not apply to gap scoring. For simple affine gap scoringmodels, the number of

free parameters is only two, so exhaustive manual optimization is the most popular

option in practice (Reese and Pearson, 2002; Vingron and Waterman, 1994).

Manual optimization, however, raises a number of importantproblems. First, manual

optimization is feasible only when the number of free parameters is relatively small.

Thus, hand-tuning faces severe challenges when examining more complicated scor-

ing models than the simplest affine gap model. Second, overlyaggressive manual

tuning can lead to overfitting to existing alignment benchmark sets (Heringa, 2002;

Edgar, 2004b). While this is less of a problem for two-parameter gap models, many

of the most successful protein alignment approaches to daterely on far more sophis-

ticated gap scoring methods which have a higher potential for overfitting.

Recently, the use of pair hidden Markov models (pair-HMMs) asan alternative scor-

ing model for sequence alignments has grown in popularity (Durbin et al., 1998). In a

2.2. RNA SECONDARY STRUCTURE PREDICTION 15

pair-HMM, the probabilistic generation of an alignment fortwo sequencesa andb is mod-

eled as a stochastic process over a Markov chain whose statescorrespond to labelings

of alignment columns as either match/mismatch, gap in sequencea, or gap in sequenceb.

Pair-HMMs allow a number of powerful parameter learning algorithms based on maximum

likelihood and expectation-maximization. For affine-gap–like scoring models, pair-HMMs

have been shown to be quite effective in practice; prior to the work presented in this the-

sis, for instance, we developed a protein sequence alignment program PROBCONS based

on pair-HMMs which, to this date, is still regarded as one of the most accurate prediction

methods available (Do et al., 2005). However, the limited flexibility of pair-HMMs due to

their independence assumptions often prevents the incorporation of higher-order features

without careful model engineering.

2.2 RNA secondary structure prediction

Nucleic acids form another class of biological macromolecules found in living cells. Nu-

cleic acid molecules, like proteins, are long polymers. Instead of having amino acids as

their basic repeating unit, however, nucleic acids are built of repeating units called nu-

cleotides. Each nucleotide consists of three main components: a chemical base, one or

more phosphate groups, and a central sugar atom to which the base and phosphate group(s)

are attached. In nature, there are two main classes of nucleic acids, namely deoxyribonu-

cleic acid (DNA) and ribonucleic acid (RNA), distinguished primarily by the type of sugar

used in each nucleotide.

DNA is a nucleic acid molecule whose primary role in the cell is the long-term stor-

age of hereditary information. Structurally, DNA is a double-stranded molecule, each of

whose strands consist of nucleotides containing deoxyribose sugar. Four types of bases

are commonly found in DNA nucleotides: adenine (A), cytosine (C), guanine (G), and

thymine (T). Deoxyribose sugar and its attached phosphate group(s) form the backbone of

each DNA strand, and the two strands are held together by hydrogen bonding between the

bases of one strand with the bases of the other strand. Due to structural constraints, adenine

can only base-pair with thymine, and cytosine can only base-pair with guanine, a property

known as complementarity. As a consequence of complementary base-pairing, given the


sequence of nucleotides in one of the two strands of a DNA molecule, the sequence of

nucleotides in the other strand is completely determined.

Like DNA, RNA is also a type of nucleic acid typically involvedin storage of hered-

itary information, though in recent years, a number of otherfunctional roles for RNAs in

living cells have been identified. Traditionally, RNAs have often been viewed in molecu-

lar biology as simply being intermediate messengers according to Crick (1958)’s “central

dogma of molecular biology” that genetic information contained in DNA is transcribed into

messenger RNA (mRNA) which, in turn, is translated into proteins. The discovery of the

central role played by RNA in the ribosome (Ban et al., 2000), along with other recent dis-

coveries of the widespread roles of RNA in regulatory processes (Mattick, 2004), however,

have led to the growing realization that RNAs are interestingin their own right.

Structurally, RNA nucleotides are similar to DNA nucleotides with the differences be-

ing that ribose sugar is used instead of deoxyribose sugar, and the bases of RNA are adenine

(A), cytosine (C), guanine (G), and uracil (U) (i.e., thymineis replaced by uracil). Un-

like DNA, RNA is generally found as a single-stranded molecule in vivo. Complementary

base-pairing still forms between nucleotides within a single RNA strand. Here, however,

base-pair complementarity exists between cytosine and guanine, adenine and uracil, and

also to a lesser extent between guanine and uracil; these three types of base-pairings (CG,

AU, and GU) are known as the “canonical” complementary base-pairings.

For RNAs, secondary structure refers to pattern of complementary base-pairing that

arises within an RNA molecule in the cell; these base-pairings form the scaffold on which

higher-order three-dimensional folding can occur. As it isnot generally easy to saya

priori which specific nucleotides in an RNA molecule will base pair, the space of possible

secondary structure configurations of the nucleotides in anRNA molecule is extremely

large. For more detail, we refer the interested reader to Chapter 6 of Alberts et al. (2002).

2.2.1 Problem definition

In the RNA secondary structure prediction problem, the goal is to predict the pattern of

base-pairing that will result when a specific RNA sequence is folded in the cell. Under-

standing RNA secondary structure has a variety of useful applications in bioinformatics:


1. Like proteins, RNAs have their intracellular functions largely determined by their

three-dimensional structurein vivo. Unlike proteins, however, the potential of com-

plementary base-pairing between an RNA molecule and either other RNA molecules

or DNA molecules gives rise to an even richer set of possible structure-mediated in-

teractions. Knowledge of the intramolecular base-pairings that form within an RNA

molecule can help determine which remaining subsequences are accessible for bind-

ing to other nucleic acid molecules (Wexler et al., 2007).

2. Mutations in RNA molecules also play an important role in evolutionary studies. In

many cases, however, the rate of nucleotide evolution in RNA sequences is extremely

high. In these cases, sequence alignment of RNA molecules canbe very difficult as

two homologous sequences may have very little similarity with which to establish

appropriate correspondences. However, RNA structure is often conserved when se-

quence is not, suggesting that knowledge of RNA secondary structure can aid in

understanding the evolution of a fast-evolving RNA species (Gardner and Giegerich,

2004).

3. In the human genome, exonic segments of DNA, which are transcribed to messenger

RNA and subsequently translated into protein, constitute approximately 2% of the

total genetic material in a cell. Recently, however, a numberof studies including the

ENCODE project (The Encode Project Consortium, 2007) have demonstrated that

despite the low proportion of coding sequence in the genome,a very high percentage

of the genome is in fact transcribed from DNA to RNA. While the high abundance

of RNA transcripts in the cell could be primarily just “transcriptional noise” (i.e.,

transcription of DNA segments to RNA with no particular function), some of these

transcripts probably do play functional roles in the cell (Mattick, 2004). Identifying

and verifying which of these transcripts are functional cannot be done using compu-

tation alone, but knowledge of RNA secondary structure can provide important clues

towards suggesting likely candidates.


2.2.2 Computational formulation

In the RNA secondary structure prediction problem, one is given an input RNA sequence

x and asked to predict a listy of the complementary base-pairings that will formin vivo.

The list of base-pairings can be thought of as a set of orderedpairs(i, j) (wherei < j)

indicating the presence of hydrogen bonds between nucleotidesxi andxj of the input RNA

sequence. Each nucleotide of the input RNA sequence may participate in at most one

base-pairing.

In order to understand how secondary structure prediction works, we consider a number

of key problem elements:

1. The space of candidate structuresY(x). Most algorithms for RNA secondary

structure prediction focus on the case of “pseudoknot-free” structures: these are

structures for which all base-pairings have a particular type of “nesting” structure

such that for any two base-pairs(i, j) and(k, l), either their ranges are disjoint (i.e.,

the interval(i, j) does not overlap with the interval(k, l)) or one base-pairing is

completely nested inside the other (e.g.,k < i < j < l). While not all RNAs in

nature have pseudoknot-free secondary structures, pseudoknot interactions are gen-

erally infrequent. Furthermore, even in RNA molecules containing pseudoknots,

the pseudoknot interactions typically still only comprisea small fraction of the total

base-pairings in the molecule (Gardner and Giegerich, 2004).

The main reasons for the pseudoknot-free restriction, however, are practical. While

some algorithms exist for RNA folding with pseudoknots, these algorithms typically

must either use extremely simplified scoring models (and hence obtain low accuracy)

or are extremely computationally inefficient. For example,Pknots (Rivas and Eddy,

1999), an RNA secondary structure prediction program that deals with a limited class

of pseudoknots, requiresO(L6) time andO(L4) space for sequences of lengthL;

adding further restrictions on the space of pseudoknots canreduce this toO(L4) time

andO(L2) space (Reeder and Giegerich, 2004). In contrast, folding algorithms de-

vised for pseudoknot-free structures are relatively efficient, with running times of


O(L3) quite common. Although some heuristic-based methods for secondary struc-

ture prediction with pseudoknots can yield reasonable complexity, poor understand-

ing of the factors underlying pseudoknot energetics have resulted in limited accuracy

for these methods. In this thesis, we choose to focus on pseudoknot-free folding, and

hence we will useY(x) to denote the set of pseudoknot-free secondary structuresy

of x.

2. The Boltzmann ensemble.Given an RNA sequencex, the simplest computational

view is to assume that there exists some single unique secondary structure configu-

rationy that is the “correct” structure forx in a typical cell. In nature, however, this

idea of a straightforward one-to-one mapping from sequenceto structure does not

hold.

A more realistic view is given by the concept of the Boltzmann distribution from

statistical mechanics. In this view, each RNA secondary structurey ∈ Y(x) can be

associated with a quantity∆E(x, y) known as the “free energy change” that rep-

resents the total energy difference associated with the formation of all bonds and

interactions iny. Given a large population of RNA molecules with sequencex, the

distribution over candidate structuresy ∈ Y(x) in the population will be given by a

temperature-dependent probability distribution known asthe Boltzmann distribution:

P (y|x) ∝ exp

(

−∆E(x, y)

RT

)

. (2.3)

Here,T is the absolute temperature of the RNA expressed in Kelvins, andR is the

universal gas constant. In words, the Boltzmann distribution provides a quantitative

characterization of the relative importance of different secondary structures that an

input sequencexmay adopt. Secondary structures with lower energy (i.e., more neg-

ative∆E(x, y)) have a higher relative frequency than structures with higher energy.

Given the large size of the space of candidate structuresY(x), the Boltzmann distri-

bution is never represented explicitly in practice, thoughsome algorithms do exist for

reasoning with the Boltzmann distribution. For example, McCaskill (1990) showed

that for certain energy models, one can efficiently compute the partition function (i.e.,


the normalization coefficient for (2.3)) and certain related quantities, such as the fre-

quency that any two particular nucleotidesi andj will base-pair in the Boltmzann

ensemble. Similarly, Ding and Lawrence (2003) proposed an efficient procedure for

taking samples from the Boltzmann ensemble.

3. Free energy minimization. In many bioinformatics tasks, having a single secondary

structurey declared as the most representative secondary structure for an input se-

quencex can be computationally more convenient than dealing with the entire Boltz-

mann ensemble of structures. The most common approach for secondary structure

prediction, thus, relies on the structure inY(x) that is theoretically the most frequent

in the Boltzmann ensemble, i.e., the minimum free energy (MFE) structure.

The first computationally efficient algorithms for free energy minimization (Nussi-

nov and Jacobson, 1980) were based on an extremely simple energetic model in

which the free energy of an RNA secondary structure was assumed to decrease lin-

early with the number of hydrogen-bonding interactions present in the secondary

structure. The various types of canonical base-pairings differ in the number of hy-

drogen bonds they imply: a CG pairing forms three hydrogen bonds, an AU pairing

form two hydrogen bonds, and a GU pairing forms a single hydrogen bond. Under

this simplified scoring model, the minimum energy structureis the structure forming

the maximum number of hydrogen bonds.

To date, the most accurate algorithms for single-sequence RNA secondary struc-

ture prediction have largely followed this same approach, but replacing the hydrogen

bond counting rule with more sophisticated energy models. These improved energy

models rely on the decomposition of an RNA secondary structure into local sec-

ondary structure elements known as “loops”. For a pseudoknot-free RNA molecule,

there are six possible types of loops: stacking base-pair loops, hairpin loops, internal

loops, bulge loops, multi-branch loops, and external loops. In modern thermody-

namic RNA secondary structure models, the energy of an RNA secondary structure

is considered to be a summation of the local energies for eachloop in the structure.

In turn, the energy of each loop depends on a number of features describing proper-

ties of the loop (e.g., nucleotide composition, loop length, loop type) (Tinoco et al.,

2.3. RNA SIMULTANEOUS ALIGNMENT AND FOLDING 21

1971; Mathews et al., 1999).

In this thesis, we focus on the problem of estimation of parameters in energetic models

of RNA structure. Traditionally, parameter estimation methods rely on a specialized pro-

cedure known as an optical melting experiment, in which a carefully prepared denatured

RNA oligomer is allowed to equilibrate in solution with its folded form. By measuring the

absorption spectrum of the mixture at different temperatures, one can determine the pro-

portion of both the denatured and native forms in the solution at these temperatures. The

resulting concentrations, then, can be used to infer the total free energy difference between

the native and denatured oligomers. Finally, by pooling total free energy data from several

optical melting experiments using different oligomers, one can then determine the energy

changes associated with specific structural features of RNAsusing a linear regression anal-

ysis (Turner et al., 1988).

Thermodynamic parameters estimated via optical melting experiments have the advan-

tage that their physical interpretation is unambiguous: parameters directly represent free

energies for specific interactions within RNA molecules. However, optical melting curve

analysis relies on the use of experiments with small RNA molecules where the various sec-

ondary structures that may appear in equilbrium are known and simple to analyze. In many

cases, this is insufficient; multi-branch loop stabilitiesfor longer RNAs, for example, are

not readily estimated in optical melting tests due to the difficulty of preparing oligomers

suitable for measuring the desired parameters. Consequently, these types of parameters are

typically hand-optimized by considering the effect of different parameter choices on struc-

ture predictions of RNAs with known structure (Mathews et al., 1999). Like with protein

folding, this process of manual fitting quite often is an openinvitation for overfitting.

2.3 RNA simultaneous alignment and folding

One of the primary drawbacks of the RNA secondary structure prediction problem, as

described in the previous section, is that even the best methods for folding RNAs from in-

dividual sequences can at best achieve limited accuracy. Current state-of-the-art techniques

predict around 70-80% of the true base-pairings that occur in an RNA structure on average.

Among the most promising directions in obtaining higher accuracy predictions has been


the idea of multiple sequence RNA secondary structure prediction. In this approach, one

is given a collection of homologous RNA sequences, which are assumed to have similar

secondary structures. One then attempts to find both a correspondence between sequence

positions in each input sequence (i.e., an alignment) and a “consensus” secondary struc-

ture shared among all of the sequences in the set. Here, the consensus secondary structure

corresponds to the subset of base-pairs that appear in the secondary structures of every

input sequence. For these types of approaches, the key intuition is that having multiple

homologs increases the amount of information available when performing the secondary

structure prediction: while multiple structures may look equally plausible when consid-

ering only a single sequence, far fewer foldings will look suitable when searching for a

secondary structure that works well for multiple sequencesat the same time.

When performed manually using large collections of sequences, multiple sequence

folding is often known as “covariation analysis” and has yielded a number of great suc-

cesses in the past. For example, covariation analysis methods were able to predict 97-98%

of the base-pairs in 16S and 23S rRNAs (Gutell et al., 2002). Covariation analysis, how-

ever, requires thousands of homologous sequences for reliable identification of covarying

positions, and hence, more fine-grained computational methods must be used when only

an intermediate number of sequences are available. Specifically, Gardner and Giegerich

(2004) described three main classes of computational methods for prediction of secondary

structures from homologous sequences.

1. Consensus folding of aligned sequences. In this plan, a standard sequence align-

ment algorithm is applied to the set of input RNA sequences in order to identify

the correspondences between nucleotides in each RNA. Once these correspondences

have been defined, then one can apply an RNA folding algorithm to the aligned

sequences; for this, one can use a variant of existing RNA folding algorithms gen-

eralized to use sequence profiles rather than individual sequences. Generally, this

class of methods has the advantage of being relatively simple and straightforward to

implement, and can work well when the sequences to be alignedhave relatively high

sequence identity. When sequence homology is low, however, the initial alignment

may be extremely unreliable, leading to downstream complications for the folding

procedure.

2.3. RNA SIMULTANEOUS ALIGNMENT AND FOLDING 23

2. Structural alignment of folded RNAs. In this plan, a single sequence RNA sec-

ondary structure prediction algorithm is run on the input RNAsequences in order to

come up with a prediction of their individual secondary structures. These structures

are then aligned using a program that takes into account the predicted structures as

well as the sequence. These types of methods are useful when secondary structures

are easy to predict anyway, but they do not take advantage of multiple sequence in-

formation at the structure prediction stage. However, provided that the structures are

predicted properly, then the resulting alignment will likely be reliable as it takes into

account RNA structure.

3. Simultaneous alignment and folding. In this plan, neither an alignment nor indi-

vidual sequence structures are precomputed. Rather, the problems of folding each

of the individual sequences and aligning them together are addressed simultaneously

in a single joint optimization. This final class of methods combines the benefits of

both of the previous approaches in that structural constraints are taken into when per-

forming the alignment, and simultaneously, multiple sequence information is used

to make informed decisions regarding structure. As a resultof tackling these prob-

lems together, however, simultaneous alignment and folding algorithms tend to be

extremely slow, requiring extreme computational resources.

In this thesis, we focus specifically on the last class of algorithms: simultaneous align-

ment and folding. Unlike for the problems of sequence alignment and RNA secondary

structure prediction, there are no standardly agreed upon computational models for RNA

simultaneous alignment and folding. However, the existingcomputational models can gen-

erally be categorized into a few groups:

1. Combination of affine-gap model and thermodynamic model.Methods of this

type attempt to define an objective function for scoring alignments and folds that

combines the standard affine gap model used in sequence alignment with the ther-

modynamic model used in single sequence RNA secondary structure prediction pro-

grams. In some sense, this could be considered the cleanest approach to defining

a scoring model for simultaneous alignment and folding in that it makes use of

the standard models for each individual problem. A direct implementation of this


strategy, however, leads to dynamic programming algorithms which requireO(L6)

time for sequences of lengthO(L) (Sankoff, 1985). As a consequence, simplifica-

tions to the scoring model are often required, and existing programs differ in the

types of simplifications used. Specific examples of programsfrom this class include

FOLDALIGN (Havgaard et al., 2005) and DYNALIGN (Mathews andTurner, 2002).

2. Posterior probability scoring. Methods of this class overcome some of the lim-

itations afforded by simplified thermodynamic models by instead defining scoring

terms based on posterior probabilities derived from probabilistic sequence alignment

models and probabilistic single sequence folding models. The prototypical example

of this type is the algorithm pmcomp of Hofacker et al. (2004) for alignment of base-

pairing probability matrices. Here, the scoring model is often derived in a heuristic

way, but the scoring terms have the advantage of incorporating at some level the

sophistication of the full thermodynamic scoring scheme without committing to spe-

cific secondary structures for each individual sequence. FOLDALIGN-M (Torarins-

son et al., 2007), LocaRNA (Will et al., 2007), and Murlet (Kiryu et al., 2007) all

rely on this strategy.

3. Conserved stem scoring.A third class of methods which simultaneously computes

alignments and foldings is the class of methods based on identification of conserved

stems. These methods differ significantly from the preceding two approaches in that

they do not explicitly optimize an objective function over all possible alignments and

foldings. Rather, these methods begin by using heuristics toidentify stems, which

are long conserved regions that correspond to consecutive base pairs in the RNA

secondary structure. Then, they provide a scoring functionfor choosing an optimal

set of stems to form the consensus RNA secondary structure. Examples of this class

include CARNAC (Perriquet et al., 2003) and SCARNA (Tabei et al.,2006).

For all of these models for simultaneous alignment and folding, there exists remarkably

little theory regarding proper parameter choice. In practice, most developers choose a few

default settings based on simple trial-and-error, using some known sequences as a guide.

As always, the number of free parameters in these models is consequently forced to be

fairly small in order to allow manual parameter exploration.

2.4. STRUCTURED PREDICTION 25

2.4 Structured prediction

For each of the three computational biology tasks considered in this thesis, the task we

wish to accomplish follows the same basic pattern: given some input object, we wish to

predict some output object. Moreover, in each of our tasks, the space of inputs and outputs

has a complex combinatorial structure, thus giving rise to aspecialized type of machine

learning problem known as “structured prediction.” In thissection, we provide a brief of

overview of some of the common notation we will use throughout this thesis for describing

structured prediction models. In later chapters, we discuss specific strategies for perform

structured learning, as well as describe our own novel techniques for convex optimization

and preventing overfitting in structured models.

In a structured prediction problem, our goal is to learn a mapping from an input space

X to an output spaceY. Here, the input spaceX denotes the domain of possible objects

that we might wish to label. These objects may be vectors in ann-dimensional Euclidean

spaceRn, boolean vectors of 0/1 features belonging to0, 1n, or complex combinatorial

objects. Similarly, the output spaceY denotes the set of possible labels for objects inX . In

some cases, the output space may depend on the specific inputx ∈ X given; in these cases,

we may writeY(x) to denote the output space for the givenx.

For example, in protein sequence alignment, the objects from the input spaceX are

pairs of amino acid sequences. Given a pairx ∈ X of amino acid sequences, the output

spaceY(x) consists of candidate alignments of those amino acid sequences. Similarly,

in RNA secondary structure prediction, objects from the input spaceX are individual un-

folded RNA sequences, and the output spaceY(x) for a sequencex ∈ X is the set of

possible candidate foldings of that sequence. Finally, in RNA simultaneous alignment and

folding, the input space consists of pairs of RNA sequences, and the output spaceY(x)

contains objects which are each candidate alignments and folds of sequence pairsx ∈ X .

Given an input spaceX and an output spaceY, we next assume the underlying concept

we would like to learn can be described as a probability distributionD overX × Y, i.e.,

pairs of input and output objects. This distributionD encodes the statistical relationship

between inputs and outputs. For example, in sequence alignment, the distributionD might

describe the joint distribution over pairs of sequencesx that a practitioner might care to


align in practice, and their desired alignmentsy.

During the learning process, the stuctured prediction learning algorithm will not have

direct access to the distributionD itself. Instead, the learner will have access to a setT :=

(x(1), y(1)), . . . , (x(m), y(m))

of training examples each independently sampled fromD,

known as a training set. Here, each training example(x(i), y(i)) is a pair consisting of a

possible inputx(i) to the structured prediction algorithm and its desired output y(i).

Two possible strategies for learning how to predict outputsy from inputsx include:

• Direct prediction approach: In this approach, one constructs a setH known as

the hypothesis class, whose elementsh ∈ H are each fixed mappings fromX to

Y. The learning problem then consists of identifying a hypothesish ∈ H which is

expected to give low error (which may be defined in a problem-specific manner) on

new examples drawn fromD by using the performance ofh on the training setTas a proxy. Many supervised learning algorithms, such as max-margin estimation,

(discussed in Chapter 6) can be considered direct predictionmethods in the sense

that they attempt to directly find low error hypotheses for predicting outputs from

inputs.

• Discriminative probabilistic modeling approach: In contrast to the direct predic-

tion approach, another strategy is to focus on the problem ofestimating a statistical

model relating inputs and outputs. Discriminative probabilistic models, in particular,

focus on modeling the distributionP (y|x) over outputsy ∈ Y(x) given an input

x ∈ X . Once an appropriate probabilistic model has been learned,then specialized

inference algorithms can be applied to these models in orderto make predictions of

y given x. A defining characteristic of probabilistic modeling approaches is their

ability to quantify the uncertainty associated with predictions.

In some cases, a discriminative probabilistic modeling approach will have the advantage of

allowing us to use inference algorithms that take advantageof the probabilistic interpreta-

tion in order to achieve higher accuracy in practical applications. In other cases, using a

direct prediction approach will have the advantage of permitting more efficient inference al-

gorithms that are not available under the discriminative probabilistic modeling framework.

In this thesis, we will see examples of both of these strategies.

Chapter 3

Discriminative training for protein

sequence alignment

In this chapter, we begin by considering the problem of protein sequence alignment using

discriminative probabilistic models. Specifically, we present CONTRAlign, an extensi-

ble and fully automatic framework for parameter learning and protein pairwise sequence

alignment using pair conditional random fields. When learning a substitution matrix and

gap penalties from as few as 20 example alignments, CONTRAlignachieves alignment

accuracies competitive with available modern tools. As confirmed by cross-validation,

CONTRAlign effectively leverages weak biological signals insequence alignment: using

CONTRAlign, we find that hydropathy-based features result in improvements of 5-6% in

aligner accuracy for sequences with less than 20% identity,a signal that state-of-the-art

hand-tuned aligners are unable to exploit effectively. Furthermore, when known secondary

structure and solvent accessibility are available, such external information is naturally in-

corporated as additional features within the CONTRAlign framework, yielding additional

improvements of up to 15-16% in alignment accuracy for low-identity sequences.

27

28 CHAPTER 3. PROTEIN SEQUENCE ALIGNMENT

3.1 Introduction

In comparative structural biology studies, analyzing or predicting protein three-dimensional

structure often begins with identifying patterns of amino acid substitution via protein se-

quence alignment. While the evolutionary information obtained from alignments can pro-

vide insights into protein structure, constructing accurate alignments may be difficult when

proteins share significant structural similarity but little sequence similarity. Indeed, for

modern alignment tools, alignment quality drops rapidly when the sequences compared

have lower than 25% identity, the “twilight zone” of proteinalignment (Rost, 1999).

In recent years, most alignment methods that have claimed improvements in alignment

accuracy have done so not by proposing substantially new algorithms for alignment but

rather by incorporating additional sources of information. For instance, when structures of

some sequences are available, the 3DCoffee program (O’Sullivan et al., 2004) uses pair-

wise alignments from existing threading-based (FUGUE (Shiet al., 2001)) and structural

(SAP (Taylor and Orengo, 1989) and LSQman (Kabsch, 1978)) alignment tools to guide

sequence alignment construction. When homologous sequences are available and compu-

tational expense is of less concern, the PRALINEPSI program (Simossis et al., 2005) uses

PSI-BLAST–derived (Altschul et al., 1997) sequence profilesto augment the amount of

evolutionary information available to the aligner. The SPEM program (Zhou and Zhou,

2005) takes the additional step of heuristically incorporating PSIPRED (Jones, 1999) pre-

dictions of protein secondary structure, a strategy also adopted in the latest version of

PRALINEPSI (Simossis and Heringa, 2005).

As these programs demonstrate, incorporating additional information can often yield

considerable benefits to alignment quality. However, choosing parameters for more com-

plex models can be difficult. In traditional dynamic-programming–based alignment pro-

grams, log-odds–based substitution matrices are estimated from large external databases

of aligned protein blocks (Henikoff and Henikoff, 1992), and gap parameters are typi-

cally “hand chosen” to maximize performance on benchmark tests (Vingron and Waterman,

1994). When dealing with more expressive models, however, the high-dimensionality of

the parameter space hinders such manual procedures. From the perspective of numeri-

cal optimization, the non-convexity of aligner performance as a function of parameters

3.1. INTRODUCTION 29

makes hand-tuning difficult for alignment algorithms that rely on complicated ad hoc scor-

ing schemes.

Furthermore, optimizing benchmark performance often leads tooverfitting, a situation

in which the selected parameters are nearly optimal for training benchmark alignments but

work poorly on new test data. To combat overfitting, many machine learning studies make

use ofholdout cross-validation, a technique in which an algorithm is trained and tested

on independent data sets in order to estimate the ability of the method to generalize to

new situations (Kohavi, 1995). Properly conducted alignment cross-validation studies are

extremely rare in the literature. In the past, a typical defense for benchmark tuning was that

aligners with few adjustable parameters are less susceptible to overfitting (Raghava et al.,

2003); such reasoning, however, is less applicable to the complicated procedures of some

modern aligners.

The reality of the dangers of overfitting in sequence alignment are not simply theoreti-

cal; in practice, their effects can be seen even with state-of-the-art methods. Edgar (2004b)

provided a comparison of several modern protein alignment programs on two different

large alignment benchmark set. Included in this comparisonwas CLUSTALW (Thompson

et al., 1994), arguably the most popular protein sequence alignment in use in the biological

community. In these tests, CLUSTALW achieved a high rank on the BAliBASE align-

ment benchmark set (Thompson et al., 1999) (which was also developed by the authors of

CLUSTALW), but performed poorly compared to nearly all other methods on the PREFAB

benchmark set (Edgar, 2004c), leading the author to suggestthat CLUSTALW, “which in-

corporates several heuristics and hence a relatively largenumber of parameters, may be

over-tuned to BAliBASE” (Edgar, 2004b).

In this chapter, we present CONTRAlign, an extensible andfully automaticframework

for parameter selection and protein pairwise sequence alignment based on a probabilis-

tic model known as a pair conditional random field (pair-CRF) (Lafferty et al., 2001; Sha

and Pereira, 2003). In the CONTRAlign methodology, the user first defines an appropri-

ate model topology for pairwise alignment. Unlike for ad hocalgorithms in which model

complexity (and hence risk of overfitting) corresponds roughly with the number of free pa-

rameters in the model, the effective complexity of a CONTRAlign pair-CRF–based model


(a) (b)

M

Ix

Iy

x GFAGy GY-G

Figure 3.1: Traditional sequence alignment model. (a) A simple three-state HMM forsequence alignment. (b) An example sequence alignment,a.

is controlled by a set of regularization parameters, allowing the user to adjust the trade-

off between model expressivity and the risk of overfitting. Given a set of gold standard

partially labeled alignments, CONTRAlign uses gradient-based optimization and holdout

cross validation to automatically determine regularization constants and a set of alignment

parameters with good expected performance for future alignment problems.

We show that even under stringent cross-validation conditions, CONTRAlign can learn

both substitution and gap parameters that generalize well to previously unseen sequences

using as few as 20 training alignments. Augmenting the aligner with sequence-based and

external features is seamless in the CONTRAlign framework, yielding large accuracy im-

provements over modern tools for “twilight zone” sequence sets.

3.2 Methods

In this section, we first review the standard three-state pair hidden Markov model (pair-

HMM) formulation of the sequence alignment problem. We alsodescribe the generaliza-

tion of the standard pair-HMM to a pair conditional random field (pair-CRF), the use of

regularization for trading off between the risk of overfitting and expressivity in a pair-CRF,

and a standard optimization procedure for learning pair-CRF parameters from data. We

then discuss a variety of model topologies and features possible within the CONTRAlign

pair-CRF framework.

3.2. METHODS 31

3.2.1 Pair-HMMs for sequence alignment

A hidden Markov model (HMM) is a type of probabilistic state machine commonly used

in a variety of a computational biology applications. As defined in Rabiner (1989), HMMs

consist of five main components: (1) a set of states, (2) a set of possible emissions associ-

ated with each state, (3) a transition probability table specifying the probability of transi-

tioning from any one state to any other states, (4) an emission probability table for each state

specifying the probability of each possible emission from that state, and (5) an initial state

probability distribution. Given all of the above, the HMM describes a probabilistic model

in which an initial state is selected from the initial state distribution, subsequent states are

chosen stochastically based on the state-to-state transition probabilities, and emissions are

generated stochastically upon entry into each state. The main point of an HMM is that

in situations where the sequence of emissions is observed and the underlying sequence of

states is hidden, knowledge of the probabilistic model allows one to reconstruct the most

likely sequence of hidden states corresponding to the observed emissions.

A pair hidden Markov model (pair-HMM) is a natural extensionof hidden Markov mod-

els useful for modeling situations in which more than one emission sequence is observed at

any given state (Durbin et al., 1998). Here, we focus on a special type of pair-HMM used

in pairwise sequence alignment problems. In this pair-HMM,two sequences of observed

emissions are generated. Unlike a regular HMM where a singleemission is generated upon

entry to each state, in this pair-HMM, each state may either generate emissions for both

of the observation sequences simultaneously, or may generate an emission for one of the

two observation sequences while not producing any emissions for the other observation

sequence.

More specifically, consider the state diagram shown in Figure 3.1 (a). In the standard

pairwise sequence alignment model, an alignment corresponds to a sequence of indepen-

dent events describing a path through the state diagram. First, an initial states is chosen

from M, Ix, Iywith probabilityπs. Then, the alignment process alternates between emit-

ting a pair of aligned residues(c, d) upon entry into some states with probabilityδ(c,d)s (or

a single unaligned residuec with probability δ(c,−)s or δ(−,c)

s ) and transitioning from some

states to another statet with probabilityτs→t.


Since each event is independent, the probability of the alignment decomposes as a prod-

uct of several terms. For instance, the joint probability ofgenerating an alignmenta and

sequencesx andy shown in Figure 3.1 (b) is

P (a, x, y) = πM · δ(G,G)M · τM→M · δ(F,Y )

M · τM→Ix· δ(A,−)

Ix· τIx→M · δ(G,G)

M . (3.1)

Alternatively, we may rewrite (3.1) asP (a, x, y;w) = exp(wT f(a, x, y)) wherew is a

parameter vector andf(a, x, y) is a vector of “feature counts” indicating the number of

times each parameter appears in the product on the right-hand side. More explicitly, if

w = [ log πM , log δ(G,G)M

, log τM→M , ··· ]T , then the corresponding feature count vector is given

by

f(a, x, y) =

# of times alignment starts in stateM

# of times alignment generates(G,G) in stateM

# of times alignment followsM →M transition...

=

1

2

1...

. (3.2)

Given two sequencesx andy, the Viterbi algorithm computes an alignmenta that maxi-

mizesP (a | x, y;w) in O(|x| · |y|) time. For the model shown in Figure 3.1, the Viterbi

algorithm is equivalent to the Needleman-Wunsch algorithm(Altschul, 1991). In this chap-

ter, we use an alternative parsing algorithm for finding alignments with the maximum ex-

pected number of correct matches; for details, see (Durbin et al., 1998; Holmes and Durbin,

1998; Do et al., 2005).

Given a collection of aligned training examplesD =

(a(i), x(i), y(i))m

i=1, the stan-

dard parameter estimation procedure (known asgenerativetraining in the machine learn-

ing literature (Ng and Jordan, 2002)) is to maximize the joint log-likelihoodℓ(w : D) :=∑m

i=1 logP (a(i), x(i), y(i);w) of the data and alignments, subject to constraints ensuring

that the original parameters (πM , δ(G,G)M , etc.) are nonnegative and normalize. When train-

ing with fully-specified alignments, the optimization problem not only is convex but also

has a closed-form solution.

In some benchmark alignment databases, such as BAliBASE (Thompson et al., 1999)

and PREFAB (Edgar, 2004c), reference alignments are partially ambiguous: certain columns

3.2. METHODS 33

are marked as reliable (known as core blocks) while the alignment of other positions

may be left unspecified. In these cases, the training setD =

(a(i), x(i), y(i))m

i=1thus

consists of partial alignmentsa(i). LettingA(i) denote the set of alignments consistent

with the known reliable columns ofa(i), the joint log-likelihood becomesℓ(w : D) :=∑m

i=1 log∑

a∈A(i) P (a, x(i), y(i);w). Despite the nonconvexity of the new optimization

problem, most numerical optimization approaches, such as EM or gradient ascent, work

well in practice (Durbin et al., 1998).1

3.2.2 From pair-HMMs to pair-CRFs

In the pair-HMM formalism, the constraints on the parametersw to represent initial, tran-

sition, or emission log probabilities allowed us to interpret a pair-HMM as defining the

quantity,P (a, x, y;w), the probability of stochastically generating an alignment. Unlike

pair-HMMs, pair-CRFs do not define this joint probability but instead directly model the

conditional probability,

P (a | x, y;w) =P (a, x, y;w)

∑

a′∈A P (a′, x, y;w)=

exp(wT f(a, x, y))∑

a′∈A exp(wT f(a′, x, y)), (3.3)

whereA denotes the set of all possible alignments ofx andy. As before, the parameter

vectorw completely parameterizes the pair-CRF, but this time, we impose no constraints

on the entries ofw. Here, a parameter entrywi does not corresponds to the log probability

of an event (as in a pair-HMM) but rather is a real-valued feature weight that either raises

or lowers the “probability mass” ofa relative to other alignments inA. Similar models

have been proposed for string edit distance in natural language processing applications

(McCallum et al., 2005; Bilenko and Mooney, 2005).

Clearly, pair-CRFs are at least as expressive as their pair-HMMcounterparts, as any

suitable parameter vectorw for an alignment pair-HMM is a valid parameter vector for

its corresponding alignment pair-CRF. Furthermore, while pair-CRFs assume a particu-

lar factorization of the conditional probability distribution P (a | x, y;w), they make far

weaker independence assumptions regarding feature countsf(a, x, y). Thus, these models

1In practice, the only step needed to ensure good convergencewas to break symmetries in the model byinitializing parameters to small random values.


are amenable to using complex feature sets that may be difficult to incorporate within a

generative pair-HMM.

Training a pair-CRF involves maximizing the conditional log-likelihood of the data

(known asdiscriminativeor conditional training (Ng and Jordan, 2002)). Unlike gener-

ative training, discriminative training directly optimizes predictive ability while ignoring

P (x, y), the model used to generate the input sequences. When a pair-CRFplaces undue

importance on unreliable features (i.e. the magnitude of some parameterwj is large), over-

fitting may occur. To prevent this, we place a Gaussian prior,P (w) ∝ exp(−∑

j Cjw2j ),

on the parametersw. Thus, we maximizeℓ(w : D) :=∑m

i=1 logP (a(i) | x(i), y(i);w) +

logP (w), or equivalently,

m∑

i=1

(

wT f(a(i), x(i), y(i))− log∑

a′∈Aexp(wT f(a′, x(i), y(i)))

)

−∑

j

Cjw2j . (3.4)

The final term in (3.4) encourages parameters to be “small” unless increased size yields

a sufficient increase in likelihood. This technique, known as regularization, leads to im-

proved generalization both in theory and in practice (Vapnik, 1998).

Parameter learning for pair-CRFs using a fixed set of regularization parametersC =

Cj is straightforward. The objective function in (3.4) is convex for fully-specified align-

ments and hence a global maximum of the regularized likelihood can be found using any

efficient gradient-based optimization algorithm (such as conjugate gradient, or L-BFGS

(Nocedal and Wright, 1999)). The gradient∇wℓ(w : D) is

m∑

i=1

(

f(a(i), x(i), y(i))− Ea∼P (A|x(i),y(i))f(a, x(i), y(i))

)

− 2C w, (3.5)

whereC w denotes the component-wise product of the vectorsC andw. Disregard-

ing regularization, we see that the partial derivative of the log-likelihood with respect

to each parameterwj is zero precisely when the observed and expected counts for the

corresponding featurefj (taken with respect to the distribution over unobserved align-

ments) match. For fully-specified alignmentsa(i), the former term in the parentheses can

be directly tabulated from the alignmenta(i), and the latter term can be computed using

3.2. METHODS 35

(a) (b)TN

x

TNy

M

I2x

I2y

TCx

TCy

I1x

I1y

M

I2x

I2y

Figure 3.2: Model variants. (a) CONTRAlignLOCAL topology with N/C-terminal flankinginserters, (b) CONTRAlignDOUBLE-AFFINE topology with two insert state pairs.

the forward-backward algorithm. The partially-specified alignment case follows similarly

(Durbin et al., 1998).

3.2.3 Pairwise alignments with CONTRAlign

In the previous subsections, we described the standard pair-HMM model for sequence

alignment and its natural extension to pair-CRFs. In this subsection, we present CON-

TRAlign, a feature-rich alignment framework that leveragesthe power of pair-CRFs to

support large non-independent feature sets while controlling model complexity via regular-

ization.

Choice of model topology.

As a baseline, we used the standard three-state pair-HMM model (CONTRAlignBASIC)

shown in Figure 3.1 (a). We experimented with a variety of other model topologies as

well, including:

• CONTRAlignLOCAL: a model with flankingN -terminal andC-terminal insert states

to allow for local homology detection (see Figure 3.2 (a)), and

• CONTRAlignDOUBLE-AFFINE, a model with an extra pair of gap states in order to model

both long and short insertions (see Figure 3.2 (b)).


Hydropathy-based gap context features.

The CLUSTALW protein multiple alignment program incorporates a large number of heuris-

tics designed to improve performance on the BAliBASE benchmark reference (Thompson

et al., 1994). One heuristic applicable to pairwise alignment is the reduction of gap penal-

ties in runs of 5 or more hydrophilic residues. Typically, the core regions of globular

proteins, where insertions and deletions are less likely, consist of hydrophobic residues.

Reducing gap penalties in hydrophilic regions encourages the aligner to place gaps in re-

gions less likely to be part of the hydrophobic core; similarheuristics are incorporated in

the MUSCLE (Edgar, 2004c) alignment program as well.

In CONTRAlign, we tested a variant of this idea (CONTRAlignHYDROPATHY) by incor-

porating hydropathy-based context features for insertionscoring. Specifically, for each

insertion open, insertion continue, or insertion close event in sequencex, we defined the

number of hydrophilic residues in a window of length 6 in sequencey to be thehydrophilic

countcontext of that event (and vice versa for insertions in sequencey). We added a total of

fourteen features to the model, seven indicating whether aninsertion open or close occurred

with a hydrophilicity context of 0, 1, . . . , or 6, and similarly for insertion continues.

Incorporating external information.

To test the ability of CONTRAlign to incorporate external information, we also experi-

mented with giving CONTRAlign data on secondary structure (CONTRAlignDSSP) and sol-

vent accessibility (CONTRAlignACCESSIBILITY) of the sequences being aligned, as extracted

from the PDBFinderII database (Krieger et al., 2004). In particular, DSSP annotations of

sequences from PDBFinderII were converted to a three-lettercode using the grouping em-

ployed in the EVA automatic structure prediction benchmarkserver,G, H, I, E, B,T, S, C (Eyrich et al., 2001). Similarly, annotations of positional amino acid solvent

accessibilities were converted from the PDBFinderII 0-9 scale using the grouping0,1, 2, 3, 4, 5, 6, 7, 8, 9. To assess the value of using predicted external tracks of

information, we also tested variants using PSIPRED single (CONTRAlignPSIPRED-SINGLE)

and multiple (CONTRAlignPSIPRED-MULTI) sequence secondary structure predictions.

For each annotation track, we added emission features to thematch and insertion states

3.3. RESULTS 37

of the basic model that would allow them to simultaneously emit both sequence and an-

notation. A similar method based on “two-track HMMs” was previously used to improve

the quality of fold recognition via predicted local structure (Karchin et al., 2003). In that

work, the authors constructed an HMM that simultaneously emitted two observation sig-

nals and relied on the assumed independence of the two character emission tracks during

parameter learning. To compensate for the violated independence assumption, the authors

added heuristic weights to each emission; thus, the “probability” of a two-track emission

was given byP (o1|s)w1P (o2|s)w2, where the weightsw1 andw2 were selected manually.

In contrast, such correction factors are not needed in the pair-CRF model presented here,

as pair-CRF learning makes no assumptions regarding the independence of the emission

features of each state. Thus, pair-CRFs provide a consistent framework for incorporat-

ing multiple sources of evidence without the need for artificial compensation as present in

multi-track generalizations of HMMs.

3.3 Results

In the protein sequence alignment literature, benchmark databases of reference alignments

have emerged as the standard metric for evaluating aligner performance. First, the aligner-

to-be-tested performs alignments for all sequence sets in the database. Then, accuracy is

measured with respect to known reliable columns of a hand-curated reference alignment.

While benchmark tests have been an invaluable asset to the development of alignment

algorithms, statistics in the literature often misrepresent the significance of accuracy dif-

ferences between aligners. Some reference databases, suchas BAliBASE and PREFAB,

contain multiple copies of a single sequence in several different alignments. Ignoring the

non-independence of these test cases artificially lowersp-values when using rank tests to

compare the performance of two aligners. Even more dangerous is the common practice

of “tuning” parameters to improve performance on individual benchmark datasets. Due to

the absence of (or improper use of) cross-validation in moststudies in the literature, good

benchmark results may not indicate good alignment accuracyfor novel proteins.


With this in mind, we designed a series of carefully controlled cross-validation exper-

iments to assess the contribution of the different model topologies/features toward CON-

TRAlign alignment accuracy, and the ability of the learned alignment model to generalize

across different benchmark reference databases.

3.3.1 Cross-validation methodology

We extracted alignments from four standard benchmarking databases:

1. BAliBASE 3.0 (Thompson et al., 2005), a collection of 218 manually refined refer-

ence multiple alignments based on 3D structural superpositions;

2. SABmark 1.65 (Walle et al., 2005), a collection of 236 very low to low identity

(“Twilight Zone”) and 462 low to intermediate identity (“Superfamilies”) sets of all-

pairs pairwise consensus structural alignments derived from the SCOP (Murzin et al.,

1995) classification;

3. PREFAB 4.0 (beta) (Edgar, 2004c), a collection of 1932 pairwise structural align-

ments supplemented by PSI-BLAST homologs from the NCBI nonredundant protein

sequence database (D. et al., 2003); and

4. HOMSTRAD (September 1, 2005 release), a curated database of 1032 structure-

based multiple alignments for homologous families (Mizuguchi et al., 1998).

We projected the BAliBASE and HOMSTRAD reference multiple alignments into all-pairs

pairwise structural alignments. Then, for each multiple sequence set from BAliBASE,

HOMSTRAD, and SABmark, we computed percent identity for all pairwise alignments

and retained the alignment with median identity.

To construct independent training and testing sets for cross-validation, we relied on the

CATH protein structure classification hierarchy (Orengo et al., 1997); a similar protocol

was followed in benchmarking the PSIPRED protein secondary structure prediction pro-

gram. Specifically, we considered a pair of alignmentsA andB independent if no two

proteinsx ∈ A andy ∈ B share the same CATH classification at the “homology” level.

Using this criterion, we used a greedy procedure to select alignments for training and test-

ing; at each step in the alignment selection process, we selected an alignment, which was

3.3. RESULTS 39

independent of all alignments previously selected, from the database with the fewest rep-

resentatives. The resulting selected pairwise alignmentsconsisted of 38 alignments from

BAliBASE, 123 from SABmark, 139 from PREFAB, and 187 from HOMSTRAD.

For parameter learning in CONTRAlign, we considered all matched positions (in core

blocks where applicable) to be labeled and treated gapped orunannotated regions as miss-

ing data. To select regularization constants in a manner strictly independent of the testing

set, we used a staged holdout cross validation procedure on the training data only. Specif-

ically, for a given training collectionD, we randomly chose 20% of the alignments for

a holdout set and performed training only on the remaining 80%. We manually divided

model features into a small number of regularization groups(usually two or three) and

constrained the regularization constants for features in each group to be the same. Starting

from a model with only transition features, we introduced new features, one group at a time.

In each iteration, we used a golden section search and standard L-BFGS optimization to

optimize holdout set conditional log-likelihood over possible settings of the regularization

parameter for the newly introduced group. Once all featureswere introduced, we retrained

the model on all of the training data using the chosen regularization constants.

We measured alignment accuracy using theQ score (Edgar, 2004c), the proportion of

true alignment character matches correctly predicted. Forpairwise alignments, theQ score

is equivalent to both the sum-of-pairs (SP) and total column(TC) score commonly used for

measuring multiple alignment accuracy (Thompson et al., 1999).

3.3.2 Comparison of model topologies and feature sets

In our first set of cross-validation experiments, we selected each of the reference databases

in turn as the testing set, and used alignments pooled from the other three databases as the

training set.2 Table 3.1 compares the various models described in Section 3.2.3 as eval-

uated on each of the four databases. As shown in the table, changes in model topology

(also possible in pair-HMM aligners) give small improvements in overall accuracy. As ex-

pected, the major improvements come with the incorporationof features based on external

2For most reference databases, with the notable exception ofSABmark 1.65, alignment accuracies areroughly consistent. This difference is likely explained bythe substantially higher proportion of low-identityalignments in SABmark, though we did not conduct a careful investigation of this phenomenon.


CONTRAlign variant BAliBASE SABmark PREFAB HOMSTRAD Overall p-value(38) (123) (139) (187) (487)

BASIC 78.93 42.04 74.40 82.61 69.73 n/aLOCAL 79.10 42.06 74.46 83.34 70.05 7.8× 10−2

DOUBLE-AFFINE 78.85 44.50 75.40 84.02 71.17 0.00040HYDROPATHY 82.07 45.61 76.75 84.78 72.38 1.5× 10−9

ACCESSIBILITY 80.80 52.09 79.47 86.84 75.49 3.1× 10−27

PSIPRED-SINGLE 77.97 44.94 74.97 82.40 70.472.9× 10−1

PSIPRED-MULTI 83.13 51.91 79.25 85.35 74.992.3× 10−21

DSSP 83.01 57.50 81.89 86.88 77.731.2× 10−33

COMBINED 88.46 61.85 83.66 88.68 80.45 1.2× 10−44

Table 3.1: Comparison of CONTRAlign variants. We counted the number of times eachvariant outperformed or was outperformed by the basic model, and assignedp-values usinga simple yet robust statistical sign test to check for deviations from a symmetric distributionin which either aligner is equally likely to do better. Accuracy improvements relative to thebasic model are significant in every case with the exceptionsof the local and PSIPREDsingle sequence prediction models.

information, such as DSSP secondary structure or solvent accessibility annotations.

Interestingly, accounting for some sequence features present in the input sequence alone

(in particular, hydropathy) gives a larger increase in performance than any change in model

topology. We return to this observation in Section 3.3.3. Also, in contrast to the massive

performance gains when using real DSSP secondary structureannotations, our numbers

suggest that predicted PSIPRED single sequence secondary structures are not informative

for alignment. PSIPRED multiple sequence predictions, however, are substantially more

accurate and give strong improvements in aligner performance.

Based on these observations, we constructed the CONTRAlignCOMBINED model, which

incorporated the four most informative components: double-affine insertion scoring, hy-

dropathy, DSSP secondary structure, and solvent accessiblity. To do this, we built an align-

ment model incorporating the latter two types of features asseparate “tracks” of informa-

tion. A variety of other encodings are possible that allow for more explicit dependencies

between secondary structure and solvent accessibility, but we did not explore this further.

For the model described, resulting alignments are on average 10% more accurate than those

using the basic model alone.

3.3. RESULTS 41

Method BAliBASE SABmark PREFAB HOMSTRAD Overall p-value(38) (123) (139) (187) (487)

MAFFT (G-INS-i) 74.56 41.25 71.37 80.53 67.53 9.8 × 10−22

MAFFT (L-INS-i) 78.08 39.58 71.95 82.01 68.12 7.1 × 10−17

T-Coffee 74.73 42.84 72.99 82.40 69.12 1.2 × 10−11

CLUSTALW 79.43 41.36 73.29 81.62 68.90 1.5 × 10−5

CLUSTALW (-nohgap) 79.65 40.92 73.51 81.35 68.77 6.2 × 10−7

MUSCLE 77.42 41.72 72.67 82.63 69.05 2.1 × 10−13

MUSCLE (-hydrofactor 0.0) 74.78 37.78 69.19 77.83 65.017.1 × 10−32

CONTRAlign (Bali, no reg) 92.57 39.33 68.77 80.45 67.68 5.7 × 10−14

CONTRAlign (Bali, reg) 84.75 39.08 73.45 82.21 69.01 1.2 × 10−7

CONTRAlign (All, reg) 82.42 47.39 76.74 85.22 73.03 0.00021PROBCONS (Bali) 78.62 42.53 73.75 83.64 70.04 4.8 × 10−8

PROBCONS (cv) 78.48 43.31 71.78 81.36 68.799.7 × 10−11

CONTRAlignHYDROPATHY 82.07 45.61 76.75 84.78 72.38 n/a

Table 3.2: Comparison of modern alignment methods.p-values indicate significance ofperformance difference between each method and CONTRAlignHYDROPATHY based on asign test, as in Table 3.1.

3.3.3 Comparison to modern sequence alignment tools

Next, we compared the CONTRAlignHYDROPATHY model to a variety of modern sequence

alignment methods, including MAFFT 5.732 (both L-INS-i andG-INS-i) (Katoh et al.,

2002, 2005), CLUSTALW 1.83 (Thompson et al., 1994), MUSCLE 3.6(Edgar, 2004c),

T-Coffee 2.66 (Notredame et al., 2000), and PROBCONS 1.10 (Do etal., 2005).3 In these

experiments, we used the existing multiple alignment toolsto compute pairwise alignments

from the cross-validation setup.

Obtaining a proper cross-validated estimate of an aligner’s performance requires tuning

the program to multiple training collections, unbiased by testing set performance. For

most modern alignment programs, avoiding testing set bias is difficult since parameters are

typically tuned by hand. Methods with automatic training procedures, like PROBCONS,

permit cross-validation to some extent, with the caveat that the program by default uses

BLOSUM62-based amino acid frequencies estimated from data overlapping all testing sets.

In Table 3.2, the overall accuracies of most modern hand-tuned methods fall within a

one percent range (68-69%). The PROBCONS (Bali) method, which uses an automatic

unsupervised learning algorithm to infer parameters from all 141 BAliBASE 2 alignments,

3The Align-m program, which was developed by the creator of the SABmark reference set, could not betested on pairwise alignments since the current version (2.3) requires at least three input sequences for analignment.


outperforms most other methods on the BAliBASE dataset except CLUSTALW, which is

based on a much more complex model with many internal parameters adjusted to maximize

performance on BAliBASE (Heringa, 2002). As previously suggested (Heringa, 2002;

Edgar, 2004b), CLUSTALW’s lower relative performance on other databases suggest that

it may indeed be overfit to its training set.

To demonstrate the dangers of such overfitting, we trained CONTRAlign on the small

set of 38 BAliBASE sequences, with and without regularization. In this situation, omit-

ting regularization leads to tremendous overfitting to BAliBASE, with regularization giv-

ing a significant improvement in accuracy. Regularization, however, is not a substitute for

proper cross-validation; when overfitting to all four databases, CONTRAlign yields clearly

over-optimistic numbers compared to the properly cross-validated test. Similarly, cross-

validated PROBCONS (despite using BLOSUM62 amino acid frequencies and thus hav-

ing an easier learning task than CONTRAlign) performs worse than the non-cross-validated

model as expected, confirming that absence of cross-validation can give significantly unre-

alistic estimates of aligner performance.

As shown, cross-validated CONTRAlign (i.e., CONTRAlignHYDROPATHY) beats current

state-of-the-art methods by 3-4% despite (1) estimating all model parameters, including the

emission matrix, and (2) following a rigorous cross-validated training procedure. Based on

the comparison of the hydropathy and basic models in Table 3.1, it is clear that these accu-

racy gains result directly from the use of hydropathy-basedgap scoring. Perhaps most strik-

ing, however, is that a variety of existing methods, including CLUSTALW and MUSCLE,

already incorporate hydropathy-based modifications in their alignment scoring, yet do not

manage to achieve above 70% accuracy on our benchmarks. Disabling these modifications

in the respective programs gives no substantial change in performance for CLUSTALW

and greatly reduces MUSCLE accuracy.4 Our result confirms that hydropathy is indeed an

important signal for protein sequence alignment and that properly accounting for this can

yield significantly higher alignment accuracy than the current state-of-the-art.

4Performing a sign test to compare performance when hydropathy scoring is either enabled or disabledyieldsp-values of 0.56 and6.28× 10−31 for CLUSTALW and MUSCLE, respectively.

3.3. RESULTS 43

+

+

++ +

ut

ut

utut

ut

rr

rr

r

0.250.350.450.550.650.75

2 4 8 16 32 64Training set size

Acc

urac

y(a)

0.00.10.20.30.40.5

0-10 10-20Percent identity (%)

Acc

urac

y

(b)

Figure 3.3: Alignment accuracy curves. (a) Accuracy as a function of training set size.The three curves give performance when using no (+), simple (), and staged (•) regular-ization. All data points are averages over 10 random training/test splits. (b) Accuracy inthe “twilight zone.” For each conservation range, the uncolored bars () give accuraciesfor MAFFT (L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) in thatorder, and the colored bar () indicates the accuracy for CONTRAlign.

3.3.4 Regularization and generalization performance

To understand the effects of regularization at low trainingset sizes, we reserved a set of 200

randomly chosen pairwise alignments pooled from all four reference databases to use as a

testing set. We then experimented with learning parametersfor the CONTRAlignHYDROPATHY

topology using varying training set sizes. For staged regularization, we considered a variant

of the basic model in which we introduced amino emission features corresponding to the

six-character reduced amino alphabet,A, G, P, S, T, C, D, E, N, Q, F, W, Y, H,

K, R, I, L, M, V, in addition to the regular twenty-letter amino acid emissions (Edgar,

2004a). In the first regularization stage, the program learns a coarse-grained substitution

matrix, followed by finer-grained refinements in the second stage.

The results in Figure 3.3 (a) demonstrate that with intelligent use of regularization, good

accuracy can be achieved with only 20 example alignments, far fewer than the number

of blocks used to estimate traditional alignment substitution matrices such as BLOSUM

(Henikoff and Henikoff, 1992); nevertheless, the simpler regularization scheme was still

quite effective compared to having no regularization at all.

For specific classes of alignments, such as sequences with long insertions or composi-

tional biases, a robust training procedure allows one to tailor the alignment algorithm to the

data; when, in addition, training data is sparse, regularization deters overfitting and enables


further customization of alignment parameters. Furthermore, as the amount of available

training data grows, accuracy will continue to increase as well.

3.3.5 Alignment accuracy in the “twilight zone”

To understand the situations in which CONTRAlignHYDROPATHY was most effective, we

stratified the 487 sequences of our dataset into several percent identity ranges and measured

the accuracy of all methods for each range. For alignments with at least 20% identity, all

methods obtained similar accuracies, ranging from 87.2% to88.7%. In the 0-10% and 10-

20% identity ranges, however, CONTRAlign accuracy was substantially higher than that

of other methods; here, CONTRAlign achieved cross-validatedaccuracies of 32.2% and

52.8% compared to non-cross-validated accuracy ranges of 25.7-26.8% and 43.0-46.5%

for all other methods (see Figure 3.3 (b)). Incorporating external sequence features such

as in the combined model of Section 3.3.2 yields accuracies of 48.0% and 68.5% (not

shown in figure), indicating that external sequence information can significantly increase

the reliability of alignments when available.

3.4 Discussion

Construction of a modern high-performance sequence alignment program involves under-

standing the variety of biological features available whenperforming alignment, building

a model of interactions demonstrating how those features may be combined in an aligner,

and careful cross-validation experiments to ensure good generalization performance of the

aligner on future data. In this chapter, we presented CONTRAlign, a pair conditional ran-

dom field for learning alignment parameters effectively even when small amounts of train-

ing data are available. Using regularization and holdout cross-validation, our algorithm au-

tomatically learns parameters with good generalization performance. Public domain source

code for CONTRAlign, datasets used in experiments from this chapter, and a web server

for submitting sequences are available online athttp://contra.stanford.edu/

contralign.

3.4. DISCUSSION 45

Since CONTRAlign specifies a conditional probability distribution over pairwise align-

ments, the PROBCONS methodology provides one straightforward extension of CON-

TRAlign to multiple alignment. The main limitation of the CONTRAlign framework,

however, is training time: L-BFGS gradient-based optimization is expensive, especially

in the context of the holdout cross validation procedure used. Typical training runs for the

experiments in this chapter (including holdout cross-validation to find regularization con-

stants) took approximately an hour on a 40-node Pentium IV cluster. For much larger scale

training sets, the computational expense of gradient-based training in this manner becomes

prohibitive; in Chapter 7, we propose novel techniques for convex optimization that are

applicable to the models described here, but that are also much more efficient in the regime

of large datasets. Perceptron learning (Collins, 2002), a recent technique for discrimina-

tively training structured probabilistic models, may provide another scalable alternative to

gradient-based optimization.

The primary advantage of CONTRAlign is its ability to free aligner developers to focus

on thebiology of sequence alignment—modelling and feature selection—while transpar-

ently taking care of details such as parameter learning and generalization performance. The

models described in this chapter were only the first steps toward a better understanding of

the sequence alignment problem. Combining new CONTRAlign topologies and features

with known successful variants should result in even higherperformance. A systematic

exploration of such possibilities remains to be done.

Chapter 4

RNA secondary structure prediction

without physics-based models

For several decades, free energy minimization methods havebeen the dominant strat-

egy for single sequence RNA secondary structure prediction.More recently, stochastic

context-free grammars (SCFGs) have emerged as an alternative probabilistic methodology

for modeling RNA structure. Unlike physics-based methods, which rely on thousands of

experimentally-measured thermodynamic parameters, SCFGsuse fully-automated statis-

tical learning algorithms to derive model parameters. Despite this advantage, however,

probabilistic methods have not replaced free energy minimization methods as the tool of

choice for secondary structure prediction, as the accuracies of the best current SCFGs have

yet to match those of the best physics-based models.

In this chapter, we present CONTRAfold, a novel secondary structure prediction method

based onconditional log-linear models(CLLMs), a flexible class of probabilistic models

which generalize upon SCFGs by using discriminative training and feature-rich scoring

models. In a series of cross-validation experiments, we show that grammar-based sec-

ondary structured prediction methods formulated as CLLMs consistently outperform their

SCFG analogs. Furthermore, CONTRAfold, a CLLM incorporating most of the features

found in typical thermodynamic models, achieves the highest single sequence prediction

accuracies to date, outperforming currently available probabilistic and physics-based tech-

niques. Our result thus closes the gap between probabilistic and thermodynamic models,

46


demonstrating that statistical learning procedures provide an effective alternative to empir-

ical measurement of thermodynamic parameters for RNA secondary structure prediction.

4.1 Introduction

In many RNA-related studies—ranging from noncoding RNA detection (Moulton, 2005)

to folding dynamics simulations (Wolfinger et al., 2004) to hybridization stability assess-

ment for microarray oligo probe selection (Rouillard et al.,2003)—knowing the secondary

structure of an RNA sequence reveals important constraints governing the molecule’s phys-

ical properties and function. To date, experimental assaysfor base-pairing in RNA se-

quences constitute the most reliable method for secondary structure determination (Furtig

et al., 2003); however, their difficulty and expense are often prohibitive, especially for

high-throughput applications. For this reason, computational prediction provides an attrac-

tive alternative to empirical discovery of RNA secondary structure (Gardner and Giegerich,

2004).

Traditionally, the most successful techniques for single sequence computational sec-

ondary structure prediction have relied on physics models of RNA structure. Methods

belonging to this category identify candidate structures for an RNA sequence by free en-

ergy minimization (Tinoco et al., 1971) through dynamic programming (e.g., Mfold (Zuker,

2003) and ViennaRNA (Hofacker et al., 1994)) or alternative optimization schemes (e.g.,

RDfolder (Ying et al., 2004)).

Parameters used in energy-based methods typically come from empirical studies of

RNA structural energetics. For example, parameters for nearest neighbor interactions in

stacking base pairs are derived from melting curves of synthesized oligonucleotides (Turner

et al., 1988). In some cases, however, the difficulty of experimental measurements places

severe restrictions on what parameters are measurable, andhence, the scoring models used.

For instance, most secondary structure programs ignore thesequence dependence of hair-

pin, bulge, internal, and multi-branch loop energies due tothe inability to quantify these

effects experimentally. Similarly, the energy of multi-branch loops in modern secondary

structure prediction programs relies on ad hoc scoring rules due to the lack of experimental

techniques for assessing the free energy contribution of multi-branch loops (Mathews et al.,

48 CHAPTER 4. RNA SECONDARY STRUCTURE PREDICTION

1999).

Recently, stochastic context-free grammars (SCFGs) have emerged as an alternative

probabilistic methodology for modeling RNA structure (Durbin et al., 1998; Knudsen and

Hein, 1999, 2003). These models specify formal grammar rules that induce a joint prob-

ability distribution over possible RNA structures and sequences. In particular, the param-

eters of SCFG models specify probability distributions overpossible transformations that

may be applied to a nonterminal, and thus are subject to the standard mathematical con-

straints of probability distributions (i.e. parameters may not be negative, and certain sets

of parameters must sum to one). Though these parameters do not have direct physical in-

terpretations, they are easily learned from collections ofRNA sequences annotated with

known secondary structures, without the need for external laboratory experiments (Dowell

and Eddy, 2004).

While fairly simple SCFGs achieve respectable prediction accuracies, attempts in re-

cent years to improve their performance using more sophisticated models have thus far

yielded only modest gains. As a result, a significant performance separation still remains

between the best physics-based methods and the best SCFGs (Dowell and Eddy, 2004).

Consequently, one might assume that such a gap is the inevitable price to be paid for using

easily learnable probabilistic models that do not provide an adequate representation of the

physics underlying RNA structural stability. We assert thatthis is not the case.

In this chapter, we present CONTRAfold, a new secondary structure prediction tool

based on a flexible probabilistic model, called aconditional log-linear model(CLLM).

CLLMs generalize upon SCFGs in the sense that any SCFG has an equivalent represen-

tation as an appropriately parameterized CLLM. Like SCFGs, CLLMs enjoy the ease of

computationally-driven parameter learning. Unlike vanilla SCFGs, however, CLLMs also

have the generality to represent complex scoring schemes, such as those used in modern

energy-based secondary structure predictors such as Mfold. CONTRAfold, a CLLM based

on a simplified Mfold-like scoring scheme, not only achievesthe highest single sequence

prediction accuracies to date but also provides users with anew mechanism for controlling

sensitivity and specificity of the prediction algorithm.

4.2. METHODS 49

4.2 Methods

In this section, we motivate the use of CLLMs for RNA secondary structure prediction

by showing how they arise as a natural extension of SCFGs. We then describe the CON-

TRAfold secondary structure model, which extends and simplifies traditional energy-based

scoring schemes while retaining the parameter learning ease of common probabilistic meth-

ods. Finally, we describe a maximum expected accuracy decoding algorithm for secondary

structure prediction which allows the user to adjust the desired sensitivity/specificity of the

returned predictions via a single parameterγ.

4.2.1 Modeling secondary structure with SCFGs

In the RNA secondary structure prediction problem, we are given an input sequencex,

and our goal is to predict the best structurey. For all of the algorithms described here,

we restrict ourselves to the case of pseudoknot-free structures, i.e., structures that have no

“crossing” base-pair interactions.1 Here, our focus is on probabilistic parsing techniques,

for which we must define a way to calculate the conditional probability P (y|x) of the

structurey given the sequencex.

Representation

Stochastic context-free grammars provide one possible wayto compactly represent a joint

probability distribution over RNA sequences and their secondary structures. Below, we

provide a condensed review of context-free grammars and their stochastic extension. We

then provide a illustrative example showing the connectionbetween stochastic context-free

grammars and RNA secondary structure prediction.

Context-free grammars (CFGs) arise in the study of formal languages as a mechanism

for generating the set of strings belonging to a language. Ina CFG, the process of generat-

ing a string begins with a sequence consisting only of singlestart symbol. In each step of

the generation process, the sequence is modified by replacing one of its symbols according

to a set of “transformation rules” (or production rules) describing the allowed substitutions

1More formally, an RNA secondary structurey is pseudoknot-free provided that there exist no two base-pairings(xi, xj) and(xk, xl) such thati < k < j < l.


that may be made. Each transformation rule has the formA→ B1B2 . . . Bk indicating that

symbolA may be replaced by the sequenceB1B2 . . . Bk. Symbols that appear on the left

side of some transformation rule are known as nonterminal symbols; the remaining sym-

bols, which never appear on the left side of any transformation rule, are known as terminal

symbols. The iterative replacement process ends whenever no valid substitutions remain,

i.e., the sequence consists of only terminal symbols. The history of substitutions needed

in order to generate this final sequence of terminal symbols starting from some original

nonterminal start symbol is often referred to as a “parse” (or derivation).

Stochastic context-free grammars (SCFGs) are an extension of context-free grammars

in which each transformation rule is associated with a probability of usage; for any given

nonterminal symbol, the sum of the probabilities for all transformation rules that could be

applied to that nonterminal symbol must be equal to one. Oncethese rule probabilities

have been defined, then an SCFG can then be said to represent a probability distribution

over possible parses, where the probability of parse is simply modeled as the product of the

probabilities associated with each of the transformation rules used.

To apply SCFGs specifically to the problem of RNA secondary structure prediction,

one must first define a set of transformation rules for the underlying CFG, and a probability

distribution over the transformation rules applicable to each nonterminal for the SCFG.

Then, one must establish a correspondence between the symbol sequences generated by

the SCFG and RNA secondary structures that might arise in practice. This last step can be

done in a variety of ways, but we provide a simple example below.

Consider the following simple SCFG for a restricted class of RNAsecondary structures:

1. Transformation rules.

S → aSu S → aS

S → uSa S → cS

S → cSg S → gS

S → gSc S → uS

S → gSu S → ǫ.

S → uSg

4.2. METHODS 51

In this CFG, the only nonterminal symbol isS. There are a total of 11 different types

of transformation rules. The symbolsa, c, g, andu all represent different types of

terminal characters. The characterǫ is not a symbol, but rather should be interpreted

as an “empty string” containing no symbols.

2. Rule probabilities. The probability of transforming a nonterminalS into aSu is

pS→aSu, and similarly for the other transformation rules. The sum of all of these

probabilities must be equal to one.

3. Mapping from parses to structures.The secondary structurey corresponding to a

parseσ contains a base pairing between two letters if and only if thetwo letters were

generated in the same step of the derivation/parse forσ.

For notational convenience, the secondary structure of a sequence can be represented in

nested parenthesesformat, in which pairs of matching parentheses represent base pairings

in the sequence. For a sequencex = agucu with secondary structurey = ((.)), the

nucleotidea in the first position pairs withu in the last position, andg pairs withc. For the

set of transformation rules above, there is only one possible sequence of rule applications

that transform an initial starting nonterminalS into the stringagucu such thata andu are

generated simultaneously, andg andc are generated simultaneously. This unique parseσ,

corresponding to the secondary structurey, is

S → aSu → agScu → aguScu → agucu (4.1)

The SCFG models thejoint probability of generating the parseσ and the sequencex as

P (x, σ) = pS→aSu · pS→gSc · pS→uS · pS→ǫ. (4.2)

It follows that

P (y|x) =∑

σ∈y

P (σ|x) =

∑

σ∈y P (x, σ)∑

σ′∈Ω(x) P (x, σ′), (4.3)

whereΩ(x) is the space of all possible parses ofx. Here, we regardy as a “set” of parses


σ sharing the same secondary structure. Note that in ambiguous grammars, the mapping

from parses to secondary structures may be many-to-one.

Parameter estimation

One of the chief advantages of SCFGs as a language for describing RNA secondary struc-

ture is the existence of well-understood algorithms for parameter estimation. Given a set

D = (x(1), y(1)), . . . , (x(m), y(m)) ofm pairs of RNA sequencesx(i) with experimentally-

validated secondary structuresy(i), the training task involves finding the set of parameters

θ = p1, . . . , pn (i.e., the probabilities for each of then transformation rules) that maxi-

mize some specified objective function.

In the popular maximum likelihood approach,θ is chosen to maximize thejoint likeli-

hoodof the training sequences and their structures,

ℓML (θ : D) =m∏

i=1

P (x(i), y(i); θ), (4.4)

subject to the constraints that all parameters must be nonnegative, and certain group of pa-

rameters must sum to one. For unambiguous grammars, the solutionθML to this constrained

optimization problem exists in closed form. Consequently, the maximum likelihood tech-

nique is by far the most commonly used method for SCFG parameter estimation in practice.

4.2.2 From SCFGs to CLLMs

Like SCFGs,conditional log-linear models(CLLMs) are probabilistic models which have

the goal of defining the conditional probability of an RNA secondary structurey given a

sequencex. Here, we motivate the CLLM framework by comparison to SCFGs.

Representation

To understand how CLLMs generalize upon the representation of conditional probabili-

ties for SCFGs, we first consider a feature-based representation of SCFGs that highlights

several important assumptions made when modeling with SCFGs. Removing these as-

sumptions leads directly to the CLLM framework.

4.2. METHODS 53

For a particular parseσ of a sequencex, let F(x, σ) ∈ Rn be ann-dimensionalfeature

vector (wheren is the number of rules in the grammar) whoseith dimension,Fi(x, σ),

indicates the number of times theith transformation rule is used in parseσ. Furthermore,

let pi denote the probability for theith transformation rule. We rewrite the joint likelihood

of the sequencex and its parseσ in log-linear form as

P (x, σ) =n∏

i=1

pFi(x,σ)i = exp

(

ln

(

n∏

i=1

pFi(x,σ)i

))

= exp

(

n∑

i=1

Fi(x, σ) ln pi

)

= exp(wTF(x, σ)), (4.5)

wherewi = ln pi. Substituting this form into equation 4.3,

P (y|x) =

∑

σ∈y exp(wTF(x, σ))∑

σ′∈Ω(x) exp(wTF(x, σ′)). (4.6)

In this alternate form, we see that SCFGs are actually log-linear models with the re-

strictions that

1. the parametersw1, . . . , wn correspond to log probabilities and hence obey a number

of constraints (e.g., all parameters must be negative), and

2. the featuresF1(x, σ), . . . , Fn(x, σ) derive directly from the grammar; thus the types

of features are restricted by the complexity of the grammar.

In both cases, the imposed restriction is unnecessary if we simply wish to ensure that the

conditional probability in equation 4.6 is well-defined. Removing these restrictions, thus,

is the basis for the CLLM framework. More generally, CLLMs are probabilistic models

defined by equation 4.6, in the case that the parametersw1, . . . , wn may take on any real

values, and the feature vectors are similarly unrestricted. Note that conditional random

fields (CRFs) (Lafferty et al., 2001) are a specialized class ofCLLMs whose probability

distributions are defined in terms of graphical models.


.A.G.A.G.A.C.U.U.C.U.

position 0 position L + 1

position i − 1 position i

nucleotide i

Figure 4.1: Positions in a sequence of lengthL = 10. Here, letxi denote theith nucleotideof x. For ease of notation, we say that there areL + 1 positionscorresponding tox—oneposition at each of the two ends ofx, andL− 1 positions between consecutive nucleotidesof x. We assign indices ranging from 0 toL for each position.

Parameter estimation

By definition, CLLMs parameterize the conditional probability P (y|x) as a log linear func-

tion of the model’s featuresF(x, σ), but they provide no manner for calculatingP (x, y).

As a side effect, straight maximum likelihood techniques, which optimize this joint proba-

bility, do not apply to CLLMs.

Instead, CLLM training relies on theconditional maximum likelihoodprinciple, in

which one finds the parameterswCML ∈ Rn that maximize theconditional likelihoodof

the structures given the sequences,

ℓCML(w : D) =m∏

i=1

P (y(i)|x(i);w). (4.7)

In practice, we avoid overfitting by placing a zero-mean Gaussian regularization prior on

the parameters, and selecting the variance of the prior using holdout cross-validation on

training data only (see Results). Arguably, for prediction problems, conditional likelihood

(or discriminative) training is more natural than joint likelihood (orgenerative) training as

it focuses on finding parameters that give good predictive performance without attempting

to model the distribution over input sequencesx.

The mechanics of performing the probabilistic inference tasks required in the opti-

mization of equation 4.7 follow closely the traditional inside and outside algorithms for

SCFGs (Durbin et al., 1998).

4.2. METHODS 55

4.2.3 From energy-based models to CLLMs

Converting an SCFG to a CLLM by removing restrictions on the parameter vectorw and

training via conditional likelihood allows SCFGs to obtain many of the benefits of the dis-

criminative learning approach. Straightforward conversions of this sort are routine in the

machine learning literature and have recently been appliedto RNA secondary structure

alignment (Sato and Sakakibara, 2005). Such conversions, however, do not take full ad-

vantage of the expressivity of CLLMs. In particular, the ability of CLLMs to use generic

feature representations means that in some cases, CLLMs can conveniently represent mod-

els which do not have compact parameterizations as SCFGs.

For example, the QRNA algorithm (Rivas and Eddy, 2000) attempts to capture the

salient properties of standard thermodynamic models for RNAsecondary structure, such

as loop lengths and base-stacking, via an SCFG. This conversion, however, is only ap-

proximate. In particular, the usual energy rules (Turner etal., 1988; Mathews et al., 1999)

containterminal mismatchterms describing the interaction between closing base pairs of

helices and nucleotides in the adjacent loop. These interactions are ignored in QRNA, and

more generally, are difficult to incorporate in SCFG models without considerably increas-

ing grammar complexity. As the authors themselves note, QRNAunderperforms compared

to standard folders, highlighting the difficulty of building SCFGs on par with energy-based

methods (Rivas and Eddy, 2000).

Contrastingly, the complex scoring terms of thermodynamic models transfer to CLLMs

with no difficulties. In the standard model, the energy of a folding σ decomposes as the

sum of energies for hairpin, interior, bulge, stacking pair, and multi-branch loops:

∆Eσ =∑

i,j

∆Ehairpin(i,j) · Ihairpin(i,j)∈σ + (4.8)

∑

i,j,ℓ

∆Ehelix(i,j,ℓ) · Ihelix(i,j,ℓ)∈σ + (4.9)

∑

i,j,i′,j′

∆Esingle branch(i,j,i′,j′) · Isingle branch(i,j)∈σ + (4.10)

∑

i,j,i1,j1,...,ik,jk

∆Emulti-branch(i,j,i1,j1,...,ik,jk) · Imulti-branch(i,j)∈σ (4.11)


In turn, the energy of each type of loop further decomposes asthe sum of interaction ener-

gies over individual features of the sequencex and its parseσ. For example, the score for

a hairpin loop spanning positionsi to j of a sequencex includes energy terms for

a. terminal mismatch stacking energies as a function of the closing base pair(xi, xj+1)

and the first unpaired nucleotides inside the loop,xi+1 andxj,

b. energies for loops of length up to 30 (i.e., a hairpin spanning positionsi to j has

lengthj − i),

c. energies for loops of length greater than 30 of the formC1 + C2 · ln(j − i),

d. bonuses for loops containing specific nucleotides,

e. bonuses for loops closed by agu base pair immediately preceded by twog’s on the

5’ side, and

f. bonuses for loops containing onlyc’s.

Mathematically, this can be expressed as

∆Ehairpin(i,j) =

∆Ehairpin length[j − i] · I0≤j−i≤30

+ ∆Eloop multiplier · (Ij−i>30 · ln(j − i))+ ∆Eloop base· Ij−i>30

+∑

a,b,c,d

∆Emismatch((a, b), c, d) · I

(xi,xj+1)=(a,b),xi+1=c,xj=d

ff + . . . (4.12)

To construct a CLLM, it suffices to replace the free energy terms with the corresponding

4.2. METHODS 57

parameters,

whairpin(i,j) =

whairpin length[j − i] · I0≤j−i≤30

+ wloop multiplier · (Ij−i>30 · ln(j − i))+ wloop base· Ij−i>30+

+∑

a,b,c,d

wmismatch((a, b), c, d) · I

(xi,xj+1)=(a,b),xi+1=c,xj=d

ff + . . . . (4.13)

This final result can be written more compactly in vector notation as

wTF(x, σ) =

whairpin length[0]

whairpin length[1]

whairpin length[2]...

T

# of hairpins inσ of length 0

# of hairpins inσ of length 1

# of hairpins inσ of length 2...

. (4.14)

Finally, using the model assumptions of CLLMs, we have

P (σ|x) =exp(wTF(x, σ))

∑

σ′∈Ω(x) exp(wTF(x, σ′))(4.15)

In short, the conversion process involves expressing the total energy of a parseσ as a linear

function of counts for joint featuresFi(x, σ) of the sequencex and the parseσ. Once this

is done, substituting into equation 4.6 gives a probabilistic model whose Viterbi parse is

the minimum energy parse. Thus, in the CLLM equivalent of the standard thermodynamic

scoring, the parametersw1, . . . , wn replace the interaction energy contributions for various

secondary elements, and the featuresF1(x, σ), . . . , Fn(x, σ) count the number of times a

particular interaction term appears in the parseσ.

4.2.4 The CONTRAfold model

The CONTRAfold program implements a CLLM for RNA secondary structure prediction,

following the general strategy for model construction outlined in the previous section. The


specific features in CONTRAfold include:

1. base pairs,

2. helix closing base pairs,

3. hairpin lengths,

4. helix lengths,

5. bulge loop lengths,

6. internal loop lengths,

7. internal loop asymmetry,

8. full two-dimensional table of internal loop scores,

9. helix base pair stacking interactions,

10. terminal mismatch interactions,

11. single (dangling) base stacking,

12. affine multi-branch loop scoring, and

13. free bases.

These features appear in the scoring decomposition for the various types of secondary

structural elements that may appear in an RNA. In particular,we have

1. Hairpin loops (see Figure 4.2)

whairpin(i, j) = wterminal mismatch((xi, xj+1), xi+1, xj)

+

whairpin length[j − i] if 0 ≤ j − i ≤ 30

wloop base+ wloop multiplier · ln(j − i) if j − i > 30.

Hairpin loop scoring in CONTRAfold accounts for terminal mismatch interactions

between the hairpin closing base pair and the adjacent nucleotides in the hairpin loop

4.2. METHODS 59

5’

3’

.A.G.A| | |.U.C.U

.

.

A

G

.

.

A

U

.

.

G

A.

nucleotide i

nucleotide j + 1

position i

position j

Figure 4.2: Correspondence between energy-based model scores and CLLM potentials forhairpin loops in CONTRAfold. The nucleotides comprising the hairpin loop are shown inred. Green dotted lines indicate the groups of nucleotides involved in the terminal mismatchinteractions considered by CONTRAfold.

(see left diagram), and hairpin lengths. Following energy-based models, we set the

score for hairpins of length beyond 30 to be affine in the logarithm of the hairpin

size.

2. Internal and bulge loops (see Figure 4.3)

wsingle(i, j, i′, j′) = fsingle length(i

′ − i, j − j′)+ wterminal mismatch((xi, xj+1), xi+1, xj)

+ wterminal mismatch((xj′ , xi′+1), xj′+1, xi′)

Internal and bulge loops (also known as single branch loops)have exactly two adja-

cent base pairs. CONTRAfold scores internal and bulge loops byaccounting for loop

lengths and terminal mismatches (see left diagram). Loop lengths are scored accord-

ing to equation 4.16. Terminal mismatches are scored in the same way as terminal

mismatches in hairpin loops.

3. Stacking pairs and helices (see Figure 4.4)


5’

3’

.A.G.A| | |.U.C.U

.

.

A.

G

A.

.

| | |G

U

.

.

G

C

.

.

U

A

.

.

.

.

.

.

nucleotide i

nucleotide j + 1

position i

position j

position i′

position j′

Figure 4.3: Correspondence between energy-based model scores and CLLM potentials forinternal and bulge loops in CONTRAfold. The nucleotides comprising the internal or bulgeloop are shown in red. Green dotted lines indicate the groupsof nucleotides involved in theterminal mismatch interactions considered by CONTRAfold.

5’

3’

.A.G.A.C.U.G

.U.C.U.G.A.C| | | | | |

.

.

A

G

.

.

.

.

.

.

position i

position j

nucleotide i + 1

nucleotide j

nucleotide i + ℓ

nucleotide j − ℓ + 1

Figure 4.4: Correspondence between energy-based model scores and CLLM potentialsfor stacking pairs and helices in CONTRAfold. The nucleotidescomprising the helix areshown in red. Green dotted lines indicate the groups of nucleotides involved in the stackinginteractions considered by CONTRAfold.

4.2. METHODS 61

5’

3’

.A.G.A| | |.U.C.U

.A.G | C.

.U.A . U.

C. ...A.G.A|||...U.C.U

. .A | U. .C | G. .. .. .

position j

position i

position j2

position i1

positions j1, i2

Figure 4.5: Correspondence between energy-based model scores and CLLM potentials formulti-branch loops in CONTRAfold. The nucleotides comprising the multi-branch loopare shown in red. Green dotted lines indicate the groups of nucleotides involved in thesingle base stacking interactions considered by CONTRAfold.

whelix(i, j, ℓ) =∑ℓ

k=1wbase pair((xi+k, xj+1−k))

+∑ℓ−1

k=1wstacking((xi+k, xj+1−k), (xi+k+1, xj−k))

+ whelix closing(xi+1, xj) + whelix closing(xj−ℓ+1, xi+ℓ)

CONTRAfold scores helices (regions of adjacent stacking basepairs) by including

terms for the contribution of closing base pairs as well as each stacking and base pair

interaction. Unlike the standard thermodynamic model, which allows only canonical

Watson-Crick or wobblegu pairs, CONTRAfold allows for all possible nucleotide

combinations.

4. Multi-branch loops (see Figure 4.5)

wmulti(i, j, i1, j1, . . . , im, jm) = wmulti base

+ wmulti paired· (m+ 1) + wmulti unpaired· ℓ+ wstacking left((xi, xj+1), xi+1) + wstacking right((xi, xj+1), xj)

+∑m

k=1wstacking left((xjk, xik+1), xjk+1)

+∑m

k=1wstacking right((xjk, xik+1), xik)

Multi-branch loops have at least three adjacent base pairs.Scoring of multi-branch


loops in CONTRAfold accounts for the single base stacking interactions and the

length dependence of multi-branch loops. For computational efficiency (Mathews

et al., 1999), multi-branch loop length scores are affine in the number of unpaired

nucleotides (ℓ) and the number of adjacent base pairs (m+ 1).

To a large extent, the features used in the CONTRAfold model closely mirror the fea-

tures employed in traditional thermodynamic models of RNA secondary structure. We

point out a few key differences:

1. CONTRAfold makes use of generic feature sets without incorporating “special cases”

typical of thermodynamic scoring models. For instance, CONTRAfold

- omits the bonus free energies for special case hairpin loops (specifically items

(d) through (f) from the list in Section 4.2.3).

- does not contain a table exhaustively enumerating all possible 1×1, 1×2, 2×2,

and2× 3 internal loops.

While such features may be useful, they are more likely to leadto overfitting due to

the large number of parameters that must be trained.2 Incorporation of a small num-

ber of specially selected interactions which are known to beparticularly importanta

priori is more feasible.

2. Internal and bulge loop lengths are scored separately as afunction of the lengthsℓ1

andℓ2 of each side of the loop:

fsingle length(ℓ1, ℓ2)

=

wbulge length[ℓ1 + ℓ2] if ℓ1ℓ2 = 0

winternal length[ℓ1 + ℓ2] otherwise

+ winternal asymmetry[|ℓ1 − ℓ2|]

+ winternal correction[ℓ1][ℓ2].

(4.16)

2This may be considered an advantage of physics-based methods; a hybrid approach which combines ma-chine learning with physics-based prior knowledge may helpalleviate the burden on the learning algorithm.

4.2. METHODS 63

In most thermodynamic models, only bulge and internal loop length score tables

exist, whereas internal asymmetry is scored according to the Ninio equations (Pa-

panicolaou et al., 1984). Here, CONTRAfold instead learns an explicit scoring table

winternal asymmetry[·] for internal loop asymmetry in addition to a two-dimensional cor-

rection matrixwinternal correction[·][·] for representing dependencies not captured by total

loop length and asymmetry alone.

3. Unlike typical energy minimization schemes, the energy of a helix consists not only

of stacking interactions but also direct base pair interactions. Also, all combina-

tions of nucleotide pairs are allowed, unlike the standard nearest neighbor model in

which only canonical Watson-Crick or wobblegu pairs are permitted. Finally, CON-

TRAfold introduces new scoring terms for helix lengths (via an explicit scoring table

for helices of length up to 5 and affine afterwards), which arenot part of the standard

nearest neighbor model.

4. Since little is currently known about the energetics of free bases (bases which do

not belong to any other loop in the secondary structure), they are typically ignored

by energy-based folders. Here, CONTRAfold introduces two scoring parameters:

wouter unpairedfor scoring each free base, andwouter pairedfor scoring each base pair ad-

jacent to a free base.

5. For simplicity, CONTRAfold scores terminal mismatches forhairpins, bulges, and

internal loops using the same parameters. CONTRAfold also does not account for

coaxial stacking dependencies when scoring multi-branch loops. Like the special

case hairpin loops mentioned earlier, making more specific scoring models by differ-

entiating between these terminal mismatches may improve prediction accuracy.

More details of the exact CONTRAfold model, including the dynamic programming

recurrences used to implement the algorithm, are presentedin Appendix A.


4.2.5 Maximum expected accuracy parsing with sensitivity/specificity

tradeoff

Most physics-based approaches to secondary structure prediction use dynamic program-

ming to recover the structure with minimum free energy (Zuker, 2003; Hofacker et al.,

1994). For probabilistic methods, the Viterbi algorithm (alternatively known as the CYK

algorithm (Durbin et al., 1998) for SCFGs) fulfills this function by finding the most likely

parse,3

σviterbi = arg maxσ∈Ω(x)

P (σ|x;w). (4.17)

Here, we describe an alternative scheme that, for a given setting of a sensitivity/specificity

tradeoff parameterγ, identifies the structure withmaximum expected accuracy.

In particular, for a candidate structurey with true structurey, letaccuracyγ(y, y) denote

the number of correctly unpaired positions iny (with respect toy) plusγ times the number

of correctly paired positions iny. Then, we wish to find,

ymea = arg maxy

Ey|x[accuracyγ(y, y)], (4.18)

where the expectation is taken with respect to the conditional distribution over structures of

the sequencex. An apparent difficulty is the cost of computing this expectation; however,

it turns out that this expectation can be decomposed very naturally in terms of quantities

that are easily computed using the inside-outside algorithm. To see why this is the case,

3For unambiguous grammars, the most likely parse is also the most likely secondary structure; however,this is not the case for ambiguous grammars (Dowell and Eddy,2004; Reeder et al., 2005).

4.2. METHODS 65

observe that

Ey|x[accuracyγ(y, y)]

= 2γ · Ey|x

[

∑

1≤i<j≤L

I(i, j) paired iny ∧ (i, j) paired iny +∑

1≤i≤L

Ii unpaired iny ∧ i unpaired iny

]

(4.19)

= 2γ ·∑

1≤i<j≤L

Ey|x[

I(i, j) paired iny ∧ (i, j) paired iny]

+∑

1≤i≤L

Ey|x[

Ii unpaired iny ∧ i unpaired iny]

(4.20)

= 2γ ·∑

(i, j) paired iny

Ey|x[

I(i, j) paired iny]

+∑

i unpaired iny

Ey|x[

Ii unpaired iny]

(4.21)

= 2γ ·∑

(i, j) paired iny

P ((i, j) paired iny|x) +∑

i unpaired iny

P (i unpaired iny|x). (4.22)

Here, the main idea is that after writing the accuracy function as a sum of indicator vari-

ables, we may use the linearity-of-expectations argument shown above in order to reduce

the quantity to be computed into the summation of probabilities. The terms of the form

P ((i, j) paired iny|x) follow directly from a variant of the standard inside/outside algo-

rithm (McCaskill, 1990). The other terms can be computed as

P (i unpaired iny|x) = 1−∑

1≤j≤L:j 6=i

P ((i, j) paired iny|x). (4.23)

Given the above decomposition, we can now specify an efficient algorithm for identi-

fying a parse with high expected accuracy. For any0 ≤ i ≤ j ≤ L, consider the following

recurrence:

Mi,j = max

0 if i = j

P (i+ 1 unpaired iny|x) +Mi+1,j if i < j

P (j unpaired iny|x) +Mi,j−1 if i < j

γ · 2P ((i+ 1, j) paired iny|x) +Mi+1,j−1 if i+ 2 ≤ j

Mi,k +Mk,j if i < k < j.

(4.24)


A simple inductive argument shows that thatM0,L = maxy(Ey|x[accuracyγ(ymea, y)]). In-

cluding the traceback for recovering the optimal structure, the parsing algorithm based on

the dynamic programming recurrence above takesO(L3) time andO(L2) space.

Note that in the above algorithm,γ controls the balance between the sensitivity and

specificity of the returned structure—i.e., higher values of γ encourage the parser to predict

more base pairings whereas lower values ofγ restrict the parser to predicting only base pairs

for which the algorithm is extremely confident. Whenγ = 1, the algorithm maximizes

the expected number of correct positions and is identical tothe parsing technique used in

Pfold (Knudsen and Hein, 2003). As shown in the Results section, by allowingγ to vary,

we may adjust the sensitivity and specificity of the parsing algorithm as desired.

4.3 Results

To assess the suitability of CLLMs as models for RNA secondary structure, we performed

a series of cross-validation experiments using known consensus secondary structures of

noncoding RNA families taken from the Rfam database (Griffiths-Jones et al., 2003, 2005).

Specifically, version 7.0 of Rfam contains seed multiple alignments for 503 noncoding

RNA families, and consensus secondary structures for each alignment either taken from a

previously published study in the literature or predicted using automated covariance-based

methods.

To establish “gold-standard” data for training and testing, we first removed all seed

alignments with only predicted secondary structures, retaining the 151 families with sec-

ondary structures from the literature. For each of these families, we then projected the

consensus family structure to every sequence in the alignment, and retained only the se-

quence/structure pair with the lowest combined proportionof missing nucleotides and non-

au,cg,gu base pairs. The end result was a set of 151 independent examples, each taken

from a different RNA family.

4.3. RESULTS 67

Grammar Generative Discriminative DifferenceG1 0.0392 0.2713 +0.2321G2 0.3640 0.5797 +0.2157G3 0.4190 0.4159 -0.0031G4 0.1361 0.1350 -0.0011G5 0.0026 0.0031 +0.0005G6 0.5446 0.5600 +0.0154G7 0.5456 0.5582 +0.0126G8 0.5464 0.5515 +0.0051G6s 0.5501 0.5642 +0.0141

Table 4.1: Comparison of generative and discriminative model structure prediction accu-racy. Each number in the table represents the area under the ROC curve of an MEA-basedparser using the indicated model. As seen below, the discriminative model consistentlyoutperforms its generative counterpart.

Generative DiscriminativeViterbi MEA Viterbi MEA

Grammar Sens (spec) Sens (spec) Sens (spec) Sens (spec)G1 0.41 (0.27) 0.18 (0.11) 0.40 (0.28) 0.48 (0.33)G2 0.53 (0.36) 0.53 (0.36) 0.63 (0.48)0.67 (0.64)G3 0.46 (0.48) 0.56 (0.51) 0.45 (0.46) 0.54 (0.53)G4 0.21 (0.17) 0.33 (0.23) 0.21 (0.17) 0.34 (0.23)G5 0.03 (0.04) 0.06 (0.04) 0.02 (0.03) 0.06 (0.04)G6 0.60 (0.61) 0.62 (0.63) 0.61 (0.62) 0.62 (0.67)G6s 0.60 (0.62) 0.62 (0.64) 0.62 (0.63) 0.65 (0.65)G7 0.58 (0.63) 0.63 (0.63) 0.58 (0.62) 0.63 (0.67)G8 0.58 (0.60) 0.63 (0.62) 0.58 (0.61) 0.65 (0.62)

Table 4.2: Comparison of MEA and Viterbi structure prediction accuracy. In each case,γwas adjusted for MEA parsing to allow a direct comparison with Viterbi, and the dominantparsing method is shown in bold. Finally, note that the results for MEA reflect only a singlechoice ofγ rather than the entire ROC curve, so one should refer to Table4.1 for a morereliable comparison of generative and discriminative MEA accuracy.


4.3.1 Comparison to generative training

In our first experiment, we took nine different grammar-based models (G1-G8, G6s) from a

recent study by Dowell and Eddy (2004) on the performance of simple SCFGs for RNA sec-

ondary structure prediction. For each grammar, we took the original SCFG and constructed

an equivalent CLLM. We then applied a two-fold cross-validation procedure to compare the

performance of SCFG (generative) and CLLM (discriminative) parameter learning.

In particular, we partitioned the 151 selected sequence-structure pairs randomly into

two approximately equal-sized “folds.” For any given setting of the MEA trade-off param-

eterγ, we used parameters trained on sequences from one fold4 to perform predictions for

all sequences from the other fold. For each tested example, we computed sensitivity and

specificity (PPV)5, defined as

sensitivity=number of correct base pairings

number of true base pairings(4.25)

specificity=number of correct base pairings

number of predicted base pairings. (4.26)

By repeating this cross-validation procedure for values ofγ ∈

2k : −5 ≤ k ≤ 10

, we

obtained a receiver operating characteristic (ROC) curve for each grammar. We report

the estimated area under each curve (see Table 4.1). In 7 out of 9 grammars, the CLLM

outperforms its SCFG counterpart.

Using a similar cross-validation protocol, we also found that MEA parsing outperforms

the Viterbi algorithm on average for both the generative anddiscriminative learned models.

In particular, when an algorithmA achieves better sensitivity and specificity than algorithm

B, we say thatA dominatesB. On 7 out of 9 generatively-trained grammars and 9 out of

9 discriminatively-trained grammars, we found aγ for which the MEA parsing algorithm

dominates the Viterbi algorithm (see Table 4.2).

4To determine smoothing parameters (for SCFGs) or regularization constants (for CLLMs), we used con-ditional log-likelihood on a holdout set taken from the training data as an estimate of the generalizationability of the learned model, and found the optimal setting of the desired parameter using a golden sectionsearch (Press et al., 1992).

5We considered onlyau, cg, andgu base pairs since many of the energy-based folders cannot predictother types of base pairings as a consequence of the nearest neighbor model.

4.3. RESULTS 69

4.3.2 Comparison to other methods

Next, we compared the performance of CONTRAfold with a number of leading prob-

abilistic and free energy minimization methods. In particular, we benchmarked Mfold

v3.2 (Zuker, 2003), ViennaRNA v1.6 (Hofacker et al., 1994), PKNOTS v1.05 (Rivas and

Eddy, 1999)6, Pfold v3.2 (Knudsen and Hein, 2003), and ILM (Ruan et al., 2004), using

default parameters for each program.7 Whenever a program returned multiple possible

structures (e.g., Mfold), we scored only the structure withminimum predicted free energy.

Unlike the other programs in our comparison, CONTRAfold’s useof the maximum ex-

pected accuracy algorithm for parsing allows it to optimizefor either higher sensitivity or

higher specificity via the constantγ. In Figure 4.6, we varied the choice ofγ for the pars-

ing algorithm so as to allow CONTRAfold to achieve many different trade-offs between

sensitivity and specificity; some of these trade-offs allowfor unambiguous comparisons

between CONTRAfold and existing methods.

As shown in Tables 4.3 and 4.4, CONTRAfold outperforms existing probabilistic and

energy-based structure prediction methods, without relying on the thousands of experi-

mentally measured parameters common among free energy minimization techniques. For

γ = 6 in particular, CONTRAfold achieves statistically significant improvements of over

4% in sensitivity and 6% in specificity relative to the best current method, Mfold. This

demonstrates not only the quality of the underlying model but also the effectiveness of the

parsing mechanism for providing a sensitivity/specificitytrade-off.

4.3.3 Feature assessment

To understand the importance of various features to the CONTRAfold model, we per-

formed an ablation analysis in which we removed various setsof features from the model

and assessed the change in total ROC area for the MEA parser. As seen in Table 4.5, the

6Because of the large size of some of the sequences in our dataset, we disabled pseudoknot prediction forPKNOTS.

7Note that while all tools listed support single sequence RNAsecondary structure prediction, not all weredesigned specifically for single sequence prediction. Pfold, for instance, was developed in the context ofmultiple sequence structure prediction; similarly, ILM and PKNOTS were developed for prediction of RNAstructures with pseudoknots, and so might fare better on sequences where pseudoknot interactions play amore important role.


MfoldViennaRNA

Pfold

ILM

PKNOTS

CONTRAfold

Other methods0.7

0.6

0.5

0.4

0.30.3 0.4 0.5 0.6 0.7 0.8

Sen

sitiv

ity

Specificity

γ = 6

γ = 0.75

Figure 4.6: ROC plot comparing sensitivity and specificity for several RNA structure pre-diction methods. CONTRAfold performance was measured at several different settings oftheγ parameter, which controls the tradeoff between the sensitivity and specificity of theprediction algorithm. As shown below, CONTRAfold achieves the highest sensitivity ateach level of specificity.

Method Sensitivity Specificity Time (s)CONTRAfold (γ = 6) 0.7377 0.6686 224Mfold 0.6943 0.6063 62ViennaRNA 0.6877 0.5922 8PKNOTS 0.6030 0.5269 460ILM 0.5330 0.4098 22CONTRAfold (γ = 0.75) 0.5540 0.7920 224Pfold 0.4906 0.7535 22

Table 4.3: Accuracies of leading secondary structure prediction methods.

4.3. RESULTS 71

Sensitivity SpecificityMethod + − p-value + − p-valueMfold 34 69 0.00081 51 77 0.0271ViennaRNA 30 72 4.9× 10-5 44 82 0.00098PKNOTS 17 94 5.5× 10-13 26 104 1.5× 10-11

ILM 20 101 3,6× 10-13 12 126 6.8× 10-22

Pfold 38 72 0.0017 41 64 0.0318

Table 4.4: Performance of CONTRAfold relative to leading secondary structure predic-tion methods. Mfold, ViennaRNA, PKNOTS, and ILM were compared to CONTRAfold(γ = 6). Pfold was compared to CONTRAfold (γ = 0.75). The numbers in the+/−columns indicate the number of times the method achieved higher (+) or lower (−) sensi-tivity/specificity than CONTRAlign.p-values were calculated using the sign test.

performance of CONTRAfold degrades as features are removed from the model.

Interestingly, even the weakest model from Table 4.5, whichincludes only features for

hairpin, bulge, internal, multi-branch loops (without accounting for internal loop asym-

metry), helix closing base pairs, and helix base pairs, achieves a respectable ROC area of

0.6003. In fact, this crippled version of CONTRAfold, which does not even account for

helix stacking interactions, manages to obtain sensitivity and specificity values of 0.7006

and 0.6193, respectively, accuracy statistically indistinguishable from Mfold.

4.3.4 Learned versus measured parameters

In many respects, the general techniques employed by CLLMs are reminiscent of many

previously described algorithms. For instance, the inside-outside algorithms inspired by

SCFGs bear close relation to McCaskill (1990)’s procedure forcomputing base-pairing

probabilities via the partition function. Indeed, one may be tempted to draw direct analo-

gies between the parameters of energy-based models and the parameters learned by the

CLLM (appropriately scaled by−RT , the negated product of the universal gas constant

and absolute temperature).

As shown in Figure 4.7, in some cases one can find a good correlation between param-

eters learned by CONTRAfold and those measured experimentally. Differences between

learned parameters and measured values, however, are not necessarily diagnostic of errors

in the laboratory measurements. Roughly speaking, the parameters learned by CLLMs


Variant ROC area DecreaseCONTRAfold 0.6433 n/a(without single base stacking) 0.6416 0.0017(without helix lengths) 0.6370 0.0063(without terminal mismatch penalties) 0.6362 0.0071(without full internal loop table) 0.6336 0.0097(without helix stacking) 0.6276 0.0157(without outer) 0.6271 0.0162(without internal loop asymmetry) 0.6134 0.0299(without all of the above) 0.6003 0.0430

Table 4.5: Ablation analysis of CONTRAfold model. A large decrease in ROC area sug-gests that the corresponding removed features play an important role in RNA secondarystructure. However, the reverse is not true: small decreases in accuracy (such as seen forsingle base stacking) may simply mean that CONTRAfold was lesseffective in leveragingthat feature for prediction.

(a) Learned

5′ −→ 3′

aXuY

3′ ←− 5′

Ya c g u

X

a 0.48 0.38 0.34 -1.24c 0.27 0.33 -1.74 0.34g 0.34 -1.63 0.27 -0.74u -1.26 0.32 -0.89 0.32

(b) Experimental

5′ −→ 3′

aXuY

3′ ←− 5′

Ya c g u

X

a . . . -0.90c . . -2.20 .g . -2.10 . -0.60u -1.10 . -1.40 .

Figure 4.7: Comparison of learned and experimentally measured stacking energies. (a)A portion of the helix stacking parameters learned by CONTRAfold, scaled by−RT atT = 310.15 K = 37C. (b) A portion of the helix stacking energies from the Turner3.0energy rules (Mathews et al., 1999), as taken from the Mfold package (Zuker, 2003).

4.4. DISCUSSION 73

reflect the degree of enrichment of their corresponding features in training set secondary

structures. Therefore, parameters which do not appear often in training set structures will

have smaller parameter values, regardless of their actual energetic contribution to real RNA

structures. Additionally, Gaussian prior regularization(see footnote to Section 4.2.2), re-

duces the magnitude of less confident parameters to prevent overfitting. Finally, CLLM

learning compensates for dependencies between parametersso as to maximize the overall

conditional likelihood of the training set; thus, the values learned for one parameter will

depend greatly on the other parameters in the model.

4.4 Discussion

In this chapter, we presented CONTRAfold, a new RNA secondary structure prediction

method based on conditional log-linear models (CLLMs). Likeprevious structure predic-

tion methods based on probabilistic models, CONTRAfold relies on statistical learning

techniques to optimize model parameters according to a training set. Unlike its predeces-

sors, however, CONTRAfold uses a discriminative training objective and flexible feature

representations in order to achieve accuracies exceeding those of the current best physics-

based structure predictors.

As a modeling framework for RNA secondary structure prediction, CLLMs provide

many advantages over physics-based models and previous probabilistic approaches, rang-

ing from ease of parameter estimation to the ability to incorporate arbitrary features. It is

only natural, then, to suspect that these advantages will carry over to related problems as

well. For instance, most current methods for multiple sequence RNA secondary structure

prediction either take a purely probabilistic approach or attempt to combine physics-based

scoring with covariation information in an ad hoc way. In contrast, the CLLM methodol-

ogy provides a principled framework for combining the rich feature sets of physics-based

methods with the predictive power of sequence covariation.

To date, SCFGs and their extensions provide the foundation for many standard com-

putational techniques for RNA analysis, ranging from modeling of specific RNA families

to noncoding RNA detection to RNA structural alignment. In each of these cases, CLLMs


may provide principled alternatives to SCFGs which can take advantage of complex fea-

tures of the input data when making predictions. Extending the CLLM methodology to

these cases provides an exciting avenue for future research.

Chapter 5

Efficient multiple hyperparameter

learning for log-linear models

In problems where input features have varying amounts of noise, using distinct regulariza-

tion hyperparameters for different features provides an effective means of managing model

complexity. While regularizers for neural networks and support vector machines often rely

on multiple hyperparameters, regularizers for structuredprediction models (used in tasks

such as sequence labeling or parsing) typically rely only ona single shared hyperparameter

for all features. In this chapter, we consider the problem ofchoosing regularization hy-

perparameters for log-linear models, a class of structuredprediction probabilistic models

which includes conditional random fields (CRFs). Using an implicit differentiation trick,

we derive an efficient gradient-based method for learning Gaussian regularization priors

with multiple hyperparameters. In both simulations and thereal-world task of computa-

tional RNA secondary structure prediction, we demonstrate that using multiple hyperpa-

rameters associated with different groups of features can provide a significant boost in ac-

curacy compared to using only a single regularization hyperparameter. Specifically, when

feature components vary in their strength of correlation with the correct output labels, using

separate regularization hyperparameters permits the model to rely on those features which

are more reliable while being more cautious with respect to those features that are more

noisy.

75

76 CHAPTER 5. HYPERPARAMETER LEARNING

5.1 Introduction

In many supervised learning methods, overfitting is controlled through the use of regular-

ization penalties for limiting model complexity. The effectiveness of penalty-based regu-

larization for a given learning task depends not only on the type of regularization penalty

used (e.g.,L1 vsL2) (Ng, 2004) but also (and perhaps even more importantly) on the choice

of hyperparameters governing the regularization penalty (e.g., the hyperparameterλ in an

isotropic Gaussian parameter prior,λ||w||2).When only a single hyperparameter must be tuned, cross-validation provides a simple

yet reliable procedure for hyperparameter selection. For example, the regularization hy-

perparameterC in a support vector machine (SVM) is usually tuned by training the SVM

with several different values ofC, and selecting the one that achieves the best performance

on a holdout set. In many situations, using multiple hyperparameters gives the distinct

advantage of allowing models with features of varying strength; for instance, in a natu-

ral language processing (NLP) task, features based on word bigrams are typically noisier

than those based on individual word occurrences, and hence should be “more regularized”

to prevent overfitting. Unfortunately, for sophisticated models with multiple hyperparam-

eters (MacKay and Takeuchi, 1998), the naıve grid search strategy of directly trying out

possible combinations of hyperparameter settings quicklygrows infeasible as the number

of hyperparameters becomes large.

Scalable strategies for cross-validation–based hyperparameter learning that rely on com-

puting the gradient of cross-validation loss with respect to the desired hyperparameters

arose first in the neural network modeling community (Larsenet al., 1996a,b; Andersen

et al., 1997; Goutte and Larsen, 1998). More recently, similar cross-validation optimiza-

tion techniques have been proposed for other supervised learning models (Bengio, 2000),

including support vector machines (Chapelle et al., 2002; Glasmachers and Igel, 2005;

Keerthi et al., 2007), Gaussian processes (Sundararajan and Keerthi, 2001; Seeger, 2007),

and related kernel learning methods (Kobayashi and Nakano,2004; Kobayashi et al., 2005;

Zhang and Lee, 2007). Here, we consider the problem of hyperparameter learning for a

specialized class of structured classification models known asconditional log-linear mod-

els(CLLMs), a generalization ofconditional random fields(CRFs) (Lafferty et al., 2001).

5.2. PRELIMINARIES 77

Whereas standard binary classification involves mapping an object x ∈ X to some

binary outputy ∈ Y (whereY = ±1), the input spaceX and output spaceY in a

structured classification task generally contain complex combinatorial objects (such as se-

quences, trees, or matchings). Designing hyperparameter learning algorithms for struc-

tured classification models thus yields a number of unique computational challenges not

normally encountered in the flat classification setting. In this chapter, we derive a gradient-

based approach for optimizing the hyperparameters of a CLLM using the loss incurred on

a holdout set. We describe the required algorithms specific to CLLMs which make the

needed computations tractable. Finally, we demonstrate onboth simulations and a real-

world computational biology task that our hyperparameter learning method can give gains

over learning flat unstructured regularization priors.

5.2 Preliminaries

Conditional log-linear models (CLLMs) are a probabilistic framework for sequence label-

ing or parsing problems, whereX is an exponentially large space of possible input se-

quences andY is an exponentially large space of candidate label sequences or parse trees.

Let F : X × Y → Rn be a fixed vector-valued mapping from input-output pairs to an

n-dimensional feature space. CLLMs model the conditional probability of y given x as

P (y|x;w) = exp(wTF(x, y))/Z(x) whereZ(x) =∑

y′∈Y exp(wTF(x, y′)). Given a

training setT =

(x(i), y(i))m

i=1of i.i.d. labeled input-output pairs drawn from some un-

known fixed distributionD overX ×Y, the parameter learning problem is typically posed

as a statistical model estimation procedure, known asmaximum a posteriori(MAP) esti-

mation (or equivalently, regularized logloss minimization). More specifically, one wishes

to find the parametersw that maximize the logarithm of the product of the probability of

the dataset under the model (i.e., the data likelihood) and aprior probability over the pa-

rameters. In this paper, we focus on the specific case where the data likelihood is taken to

be the conditional probability of the outputsy(i) given their corresponding inputsx(i), and


where the parameter prior has the form of a zero-mean multivariate Gaussian,

p(w) ∝ exp

(

−1

2wTCw

)

(5.1)

whereC is a positive definite matrix representing the inverse covariance matrix of the

Gaussian. Omitting additive constants, the MAP estimationproblem can be written as

w⋆ = arg minw∈Rn

(

1

2wTCw −

m∑

i=1

logP (y(i)|x(i);w)

)

. (OPT1)

Informally speaking, the presence of the parameter prior term ensures that the learned

parametersw do not get too big, thus preventing overfitting of the model.

While a number of efficient procedures exist for solving the optimization problem

OPT1 (Sha and Pereira, 2003; Globerson et al., 2007), littleattention is usually given to

choosing an appropriate regularization matrixC. Generally,C is parameterized using a

small number of free variables,d ∈ Rk, known as thehyperparametersof the model.

Given a holdout setH =

(x(i), y(i))m

i=1of i.i.d. examples drawn fromD, hyperparameter

learning itself can be cast as an optimization problem:

minimized∈Rk

−m∑

i=1

logP(

y(i)|x(i);w⋆(C))

. (OPT2)

In words, OPT2 finds the hyperparametersd whose regularization matrixC leads the pa-

rameter vectorw⋆(C) learned from the training set to obtain small logloss on holdout data.

For many real-world applications,C is assumed to take a simple form, such as a scaled

identity matrix,CI. While this parameterization may be partially motivated by concerns of

hyperparameter overfitting (Ng, 1997), such a choice usually stems from the difficulty of

hyperparameter inference.

In practice, grid-search procedures provide a reliable method for determining hyperpa-

rameters to low-precision: one trains the model using several candidate values ofC (e.g.,

C ∈ . . . , 2−2, 2−1, 20, 21, 22, . . .), and chooses theC that minimizes holdout logloss.

While this strategy is suitable for tuning a single model hyperparameter, more sophisti-

cated strategies are necessary when optimizing multiple hyperparameters.

5.3. LEARNING MULTIPLE HYPERPARAMETERS 79

5.3 Learning multiple hyperparameters

In this section, we lay the framework for multiple hyperparameter learning by describing

a simple yet flexible parameterization ofC that arises quite naturally in many practical

problems. We then describe a generic strategy for hyperparameter adaptation via gradient-

based optimization.

Consider a setting in which predefined subsets of parameter components (which we

call regularization groups) are constrained to use the same hyperparameters (Do et al.,

2006a). For instance, in an NLP task, individual word occurrence features may be placed

in a separate regularization group from word bigram features. Formally, letk be a fixed

number of regularization groups, and letπ : 1, . . . , n → 1, . . . , k be a prespecified

mapping from parameters to regularization groups. Furthermore, for a vectorx ∈ Rk,

define its expansionx ∈ Rn asx = (xπ(1), xπ(2), . . . , xπ(n)).

In the sequel, we parameterizeC ∈ Rn×n in terms of some hyperparameter vector

d ∈ Rk as the diagonal matrix,C(d) = diag(exp(d)).1 Under this representation,C(d)

is necessarily positive definite, since the diagonal elements are strictly positive, and there

are no off-diagonal elements. Thus, OPT2 can be written as anunconstrained minimization

over the variablesd ∈ Rk. Specifically, letℓT (w) = −∑m

i=1 logP(

y(i)|x(i);w)

denote

the training logloss andℓH(w) = −∑mi=1 logP

(

y(i)|x(i);w)

the holdout logloss for a

parameter vectorw. Omitting the dependence ofC on d for notational convenience, we

have the optimization problem

minimized∈Rk

ℓH(w⋆) subject to w⋆ = arg minw∈Rn

(

1

2wTCw + ℓT (w)

)

. (OPT2’)

For any fixed setting of these hyperparameters, the objective function of OPT2’ can be

1Here, we use the notation thatexp(·) is an operation that, when applied to a vectorv ∈ Rn, returns a

vector of the same dimensionality such that

exp

v1

...vn

=

exp(v1)...

exp(vn)

. (5.2)

Also,diag(·) is an operation that, when applied to a vectorv ∈ Rn, returns ann× n matrix whose diagonal

elements consist of the components ofv, and whose other entries are all zero.


evaluated by (1) using the hyperparametersd to determine the regularization matrixC, (2)

solving OPT1 usingC to determinew⋆ and (3) computing the holdout logloss using the

parametersw⋆. In this next section, we derive a method for computing the gradient of the

objective function of OPT2’ with respect to the hyperparameters. Given both procedures

for function and gradient evaluation, we may apply standardgradient-based optimization

(e.g., conjugate gradient or L-BFGS (Nocedal and Wright, 1999)) in order to find a local

optimum of the objective. In general, we observe that only a few iterations (∼ 5) are usually

sufficient to determine reasonable hyperparameters to low accuracy.

5.4 The hyperparameter gradient

Note that the optimization objectiveℓH(w⋆) is a function ofw⋆. In turn,w⋆ is a function of

the hyperparametersd, as implicitly defined by the gradient stationarity condition,Cw⋆ +

∇wℓT (w⋆) = 0. To compute the hyperparameter gradient, we will use both ofthese facts.

5.4.1 Deriving the hyperparameter gradient

First, we apply the chain rule to the objective function of OPT2’ to obtain

∇dℓH(w⋆) = JTd∇wℓH(w⋆) (5.3)

whereJd is then×k Jacobian matrix whose(i, j)th entry is∂w⋆i /∂dj. The term∇wℓH(w⋆)

is simply the gradient of the holdout logloss evaluated atw⋆. For decomposable models,

this may be computed exactly via dynamic programming (e.g.,the forward/backward al-

gorithm for chain-structured models or the inside/outsidealgorithm for grammar-based

models).

Next, we show how to compute the Jacobian matrixJd. Recall that at the optimum

of the smooth unconstrained optimization problem OPT1, thepartial derivative of the ob-

jective with respect to any parameter must vanish. In particular, the partial derivative of

5.4. THE HYPERPARAMETER GRADIENT 81

12wTCw + ℓT (w) with respect towi vanishes whenw = w⋆, so

0 = CTi w⋆ +

∂

∂wi

ℓT (w⋆), (5.4)

whereCTi denotes theith row of theC matrix. Since (5.4) uniquely definesw⋆ (as OPT1

is a strictly convex optimization problem), we can use implicit differentiation to obtain the

needed partial derivatives. Specifically, we can differentiate both sides of (5.4) with respect

to dj to obtain

0 =n∑

p=1

(

w⋆p

∂

∂dj

Cip + Cip∂

∂dj

w⋆p

)

+n∑

p=1

∂

∂wp

∂

∂wi

ℓT (w⋆)∂

∂dj

w⋆p, (5.5)

= Iπ(i)=jw⋆i exp(dj) +

n∑

p=1

(

Cip +∂

∂wp

∂

∂wi

ℓT (w⋆)

)

∂

∂dj

w⋆p. (5.6)

Stacking (5.6) for alli ∈ 1, . . . , n and j ∈ 1, . . . , k, we obtain the equivalent

matrix equation,

0 = B + (C +∇2wℓT (w⋆))Jd (5.7)

whereB is then× k matrix whose(i, j)th element isIπ(i)=jw⋆i exp(dj), and∇2

wℓT (w⋆)

is the Hessian of the training logloss evaluated atw⋆. Finally, solving these equations for

Jd, we obtain

Jd = −(C +∇2wℓT (w⋆))−1B. (5.8)

5.4.2 Computing the hyperparameter gradient efficiently

In principle, one could simply use (5.8) to obtain the Jacobian matrixJd directly. How-

ever, computing then × n matrix (C + ∇2wℓT (w⋆))−1 is difficult. Computing the Hes-

sian matrix∇2wℓT (w⋆) in a typical CLLM requires approximatelyn times the cost of a

single logloss gradient evaluation. Once the Hessian has been computed, typical matrix

inversion routines takeO(n3) time. Even more problematic, theΩ(n2) memory usage for

storing the Hessian is prohibitive as typical log-linear models (e.g., in NLP) may have


Algorithm 1 : Gradient computation for hyperparameter selection.

Input: training setT =

(x(i), y(i))m

i=1, holdout setH =

(x(i), y(i))m

i=1

current hyperparametersd ∈ Rk

Output: hyperparameter gradient∇dℓH(w⋆)

1. Compute solutionw⋆ to OPT1 using regularization matrixC = diag(exp(d)).

2. Form the matrixB ∈ Rn×k such that(B)ij = Iπ(i)=jw

⋆i exp(dj).

3. Use conjugate gradient algorithm to solve the linear system,

(C +∇2wℓT (w⋆))x = ∇wℓH(w⋆).

4. Return−BTx.

Figure 5.1: Pseudocode for gradient computation

thousands or even millions of features. To deal with these problems, we first explain why

(C + ∇2wℓT (w⋆))v for any arbitrary vectorv ∈ R

n can be computed inO(n) time, even

though forming(C +∇2bwℓT (w⋆))−1 is expensive. Using this result, we then describe an

efficient procedure for computing the holdout hyperparameter gradient which avoids the

expensive Hessian computation and inversion steps of the direct method.

First, sinceC is diagonal, the product ofC with any arbitrary vectorv is trivially

computable inO(n) time. Second, although direct computation of the Hessian isinefficient

in a generic log-linear model, computing the product of the Hessian withv can be done

quickly, using any of the following techniques, listed in order of increasing implementation

effort (and numerical precision):

1. Finite differencing. Use the following numerical approximation:

∇2wℓT (w⋆) · v = lim

r→0

∇wℓT (w⋆ + rv)−∇wℓt(w⋆)

r. (5.9)

2. Complex step derivative(Martins et al., 2003). Use the following identity from

5.4. THE HYPERPARAMETER GRADIENT 83

complex analysis:

∇2wℓT (w⋆) · v = lim

r→0

ℑ∇wℓT (w⋆ + i · rv)

r. (5.10)

whereℑ· denotes the imaginary part of its complex argument (in this case, a vector).

Because there is no subtraction in the numerator of the right-hand expression, the

complex-step derivative does not suffer from the numericalproblems of the finite-

differencing method that result from cancellation. As a consequence, much smaller

step sizes can be used, allowing for greater accuracy.

3. Analytical computation. Given an existingO(n) algorithm for computing gradients

analytically, define the differential operator

Rvf(w) = limr→0

f(w + rv)− f(w)

r=

∂

∂rf(w + rv)

∣

∣

∣

∣

r=0

, (5.11)

for which one can verify thatRv∇wℓT (w⋆) = ∇2wℓT (w⋆) · v. By applying stan-

dard rules for differential operators,Rv∇wℓT (w⋆) can be computed recursively

using a modified version of the original gradient computation routine; see Pearlmut-

ter (1994) for details.

Hessian-vector products for graphical models were previously used in the context of step-

size adaptation for stochastic gradient descent (Vishwanathan et al., 2006). In our experi-

ments, we found that the simplest method, finite-differencing, provided sufficient accuracy

for our application. However, we do point out that all of the above methods, including the

analytical computation approach, share the sameO(n) time guarantees.

Given the above procedure for computing matrix-vector products, we can now use the

conjugate gradient(CG) method to solve the matrix equation (5.7) to obtainJd. Unlike

direct methods for solving linear systemsAx = b, CG is an iterative method which relies

on the matrixA only through matrix-vector productsAv. In practice, few steps of the CG

algorithm are generally needed to find an approximate solution of a linear system with ac-

ceptable accuracy. Using CG in this way amounts to solvingk linear systems, one for each

column of theJd matrix. Unlike the direct method of forming the(C +∇2wℓT (w⋆)) ma-

trix and its inverse, solving the linear systems avoids the expensiveΩ(n2) cost of Hessian


computation and matrix inversion.

Nevertheless, even this approach for computing the Jacobian matrices still requires the

solution of multiple linear systems, which scales poorly when the number of hyperparam-

etersk is large. However, we can do much better by reorganizing the computations in such

a way that the Jacobian matrixJd is never explicitly required. In particular, substituting

(5.8) into (5.3),

∇dℓH(w⋆) = −BT (C +∇2wℓT (w⋆))−1∇wℓH(w⋆) (5.12)

we observe that it suffices to solve the single linear system,

(C +∇2wℓT (w⋆))x = ∇wℓH(w⋆) (5.13)

and then form∇dℓH(w⋆) = −BTx. By organizing the computations this way, the number

of least squares problems that must be solved is substantially reduced fromk to only one.

A similar trick was previously used for hyperparameter adaptation in SVMs (Keerthi et al.,

2007) and kernel logistic regression (Seeger, 2007). Figure 5.1 shows a summary of our

algorithm for hyperparameter gradient computation. In practice, roughly 50-100 iterations

of CG were sufficient to obtain hyperparameter gradients, meaning that the cost of running

Algorithm 1 was approximately the same as the cost of solvingOPT1 for a single fixed

setting of the hyperparameters. Roughly 3-5 line searches were sufficient to identify good

hyperparameter settings; assuming that each line search takes 2-4 times the cost of solv-

ing OPT1, the overall hyperparameter learning procedure takes approximately 20 times the

cost of solving OPT1 once.

5.5 Experiments

To test the effectiveness of our hyperparameter learning algorithm, we applied it to two

tasks: a simulated sequence labeling task involving noisy features, and a real-world ap-

plication of conditional log-linear models to the biological problem of RNA secondary

structure prediction.

5.5. EXPERIMENTS 85

y1 y2 · · · yL

xj1 xj

2· · · xj

L

xj1 xj

2· · · xj

LxjL

“observed features” j ∈ 1, . . . , R

“noise features” j ∈ R + 1, . . . , 40

Figure 5.2: State diagram of the HMM used in the simulations.

Sequence labeling simulation

For our simulation test, we constructed a simple linear-chain hidden Markov model (HMM)

with binary-valued hidden nodes,yi ∈ 0, 1. For our HMM, we set initial state proba-

bilities to 0.5 each, and used self-transition probabilities of0.6. We associated 40 binary-

valued featuresxji , j ∈ 1, . . . , 40 with each hidden stateyi, includingR “relevant” ob-

served features whose values were chosen based onyi, and(40 − R) “irrelevant” noise

features whose values were chosen to be either 0 or 1 with equal probability, independent

of yi. Specifically, we drew eachxji independently according toP (xj

i = v|yi = v) = 0.6,

v ∈ 0, 1. Figure 5.2 shows the graphical model representing the HMM.For each run, we

used the HMM to simulate training, holdout, and testing setsofM , 10, and 1000 sequences,

respectively, each of length 10.

Next, we constructed a CRF based on an HMM model similar to that shown in Fig-

ure 5.2 in which potentials were included for the initial node y1, between eachyi and

yi+1, and betweenyi and eachxji (including both the observed features and the noise fea-

tures). We then performed gradient-based hyperparameter learning using three different

parameter-tying schemes: (a) all hyperparameters constrained to be equal, (b) separate hy-

perparameter groups for each parameter of the model, and (c)transitions, observed features,

and noise features each grouped together. Figures 5.3 and 5.4 show the performance of the

CRF for each of the three parameter-tying gradient-based optimization schemes, as well

as the performance of scheme (a) when using the standard grid-search strategy of trying

regularization matricesCI for C ∈ . . . , 2−2, 2−1, 20, 21, 22, . . ..As seen in Figures 5.3 and 5.4, the gradient-based procedureperformed either as well as


0 10 20 30 400.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Number of relevant features, R

Pro

port

ion

of in

corr

ect l

abel

s gridsingleseparategrouped

Figure 5.3: HMM testing set performance when varying the number of relevant features,R, usingM = 10. Each point represents an average over 100 independent runsof HMMtraining/holdout/testing set generation and CRF training and hyperparameter optimization.

or better than a grid search for single hyperparameter models. Using either a single hyper-

parameter or all separate hyperparameters generally gave similar results, with a slight ten-

dency for the separate hyperparameter model to overfit. Enforcing regularization groups,

however, gave consistently lower error rates, achieving anabsolute reduction in general-

ization error over the next-best model of 6.7%, corresponding to a relative reduction of

16.2%.

RNA secondary structure prediction

We also applied our framework to the problem of RNA secondary structure prediction.

Ribonucleic acid (RNA) molecules are long nucleic acid polymers present in the cells of all

living organisms. For many types of RNA, three-dimensional (or tertiary) structure plays an

important role in determining the RNA’s function. Here, we focus on the task of predicting

RNA secondary structure, i.e., the pattern of nucleotide base pairings which form the two-

dimensional scaffold upon which RNA tertiary structures assemble (see Figure 5.5a).

As a starting point, we used our CONTRAfold program, as described in Chapter 4.

5.5. EXPERIMENTS 87

0 20 40 60 800.3

0.35

0.4

0.45

0.5

0.55

Training set size, M

Pro

port

ion

of in

corr

ect l

abel

s gridsingleseparategrouped

Figure 5.4: HMM testing set performance when varying the number of training examplesM , usingR = 5. Each point represents an average over 100 independent runsof HMMtraining/holdout/testing set generation and CRF training and hyperparameter optimization.

To review, CONTRAfold addresses the problem of modeling RNA secondary structures

using a CLLM whose features are chosen to closely match the energetic terms found in

standard physics-based models of RNA structure (e.g., hairpin loops, bulge loops, interior

loops, etc.). Training the CONTRAfold model involves a regularized log-likelihood max-

imization, and to control overfitting, CONTRAfold uses flatL2 regularization. Here, we

modified the implementation described in Chapter 4 to performan “outer” optimization

loop based on our algorithm, and chose regularization groups either by (a) enforcing a sin-

gle hyperparameter group, (b) using separate groups for each parameter, or (c) grouping

according to the type of each feature (e.g., all features fordescribing hairpin loop lengths

were placed in a single regularization group).

For testing, we replicated the testing procedure employed in Chapter 4 that involved

two-fold cross-validation on 151 RNA sequences from the Rfam database (Griffiths-Jones

et al., 2005). Despite the small size of the training set (75 sequences per fold), the hy-

perparameters learned on each fold were nonetheless qualitatively similar, indicating the

robustness of the procedure (see Table 5.1). As expected, features with small regularization

hyperparameters correspond to properties of RNAs which are known to contribute strongly


(a)

(b)

RNA sequence secondarystructure

uccguagaaggc5’ 3’

3’5’.a.g.g

|||

.u

.c

.c.. ag.. au.. ga.

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.80.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Mfold

ViennaRNA

PKNOTS

ILM

Pfold

CONTRAfold (our algorithm)

Specificity

Sen

sitiv

ity

single (AUC=0.6169, logloss=5916)separate (AUC=0.6383, logloss=5763)grouped (AUC=0.6406, logloss=5531)

Figure 5.5: RNA secondary structure prediction. (a) An illustration of the secondary struc-ture prediction task. (b) Performance comparison with state-of-the-art methods when usingeither a single hyperparameter (the “original” CONTRAfold),separate hyperparameters,or grouped hyperparameters.

to the energetics of RNA secondary structure, whereas many ofthe features with larger

regularization hyperparameters indicate structural properties whose presence/absence are

either less correlated with RNA secondary structure or sufficiently noisy that their parame-

ters are difficult to determine reliably from the training data.

We then compared the cross-validated performance of algorithm with state-of-the-art

methods (see Figure 5.5b).2 Using separate or grouped hyperparameters both gave in-

creased sensitivity and increased specificity compared to the original model, which was

2Following Do et al. (2006b), we used the maximum expected accuracy algorithm for decoding, whichreturns a set of candidates parses reflecting different trade-offs between sensitivity (proportion of true base-pairs called) and specificity (proportion of called base-pairs which are correct).

5.6. DISCUSSION AND RELATED WORK 89

Regularization groupexp(di)

fold A fold B

hairpin loop lengths 0.0832 0.456helix closing base pairs 0.780 0.0947symmetric internal loop lengths 6.32 0.0151external loop lengths 0.338 0.401bulge loop lengths 0.451 2.03base pairings 2.01 7.95internal loop asymmetry 4.24 6.90explicit internal loop sizes 12.8 6.39terminal mismatch interactions 132 50.2single base pair stacking interactions 71.0 1041× 1 internal loop nucleotides 139 120.single base bulge nucleotides 136 130.internal loop lengths 1990 35.3multi-branch loop lengths 359 2750helix stacking interactions 12100 729

Table 5.1: Grouped hyperparameters learned using our algorithm for each of the two cross-validation folds.

learned using a single regularization hyperparameter. Overall, the testing logloss (summed

over the two folds) decreased by roughly 6.5% when using grouped hyperparameters and

2.6% when using multiple separate hyperparameters, while the estimated testing ROC area

increased by roughly 3.8% and 3.4%, respectively.

5.6 Discussion and related work

In this work, we presented a gradient-based approach for hyperparameter learning based

on minimizing logloss on a holdout set. While the use of cross-validation loss as a proxy

for generalization error is fairly natural, in many other supervised learning methods be-

sides log-linear models, other objective functions have been proposed for hyperparameter

optimization. In SVMs, approaches based on optimizing generalization bounds (Chapelle

et al., 2002), such as the radius/margin-bound (Keerthi, 2002) or maximal discrepancy cri-

terion (Anguita et al., 2003) have been proposed. Comparablegeneralization bounds are


not generally known for CRFs; even in SVMs, however, generalization bound-based meth-

ods empirically do not outperform simpler methods based on optimizing five-fold cross-

validation error (Duan et al., 2003).

A different method for dealing with hyperparameters, common in neural network mod-

eling, is the Bayesian approach of treating hyperparametersthemselves as parameters in the

model to be estimated. In an ideal Bayesian scheme, one does not perform hyperparameter

or parameter inference, but rather integrates over all possible hyperparameters and param-

eters in order to obtain a posterior distribution over predicted outputs given the training

data. This integration can be performed using a hybrid MonteCarlo strategy (Neal, 1996;

Williams and Barber, 1998). For the types of large-scale log-linear models we consider

in this chapter, however, the computational expense of sampling-based strategies can be

extremely high due to slow convergence of MCMC techniques (Murray and Ghahramani,

2004).

Empirical Bayesian (i.e., ML-II) strategies, such as Automatic Relevance Determina-

tion (ARD) (MacKay, 1992), take the intermediate approach ofintegrating over parameters

to obtain the marginal likelihood (known as the log evidence), which is then optimized with

respect to the hyperparameters. Computing marginal likelihoods, however, can be quite

costly, especially for log-linear models. One method for doing this involves approximating

the parameter posterior distribution as a Gaussian centered at the posterior mode (MacKay,

1992; Wellings and Parise, 2006). In this strategy, however, the “Occam factor” used for

hyperparameter optimization still requires a Hessian computation, which does not scale

well for log-linear models. An alternate approach based on using a modification of expec-

tation propagation (EP) (Minka, 2001) was applied in the context of Bayesian CRFs (Qi

et al., 2005) and later extended to graph-based semi-supervised learning (Kapoor et al.,

2006). As described, however, inference in these models relies on non-traditional “probit-

style” potentials for efficiency reasons, and known algorithms for inference in Bayesian

CRFs are limited to graphical models with fixed structure.

In contrast, our approach works broadly for a variety of log-linear models, including the

grammar-based models common in computational biology and natural language process-

ing. Furthermore, our algorithm is simple and efficient, both conceptually and in practice:

one iteratively optimizes the parameters of a log-linear model using a fixed setting of the

5.6. DISCUSSION AND RELATED WORK 91

hyperparameters, and then one changes the hyperparametersbased on the holdout logloss

gradient. The gradient computation relies primarily on a simple conjugate gradient solver

for linear systems, coupled with the ability to compute Hessian-vector products (straight-

forward in any modern programming language that allows for operation overloading). As

we demonstrated in the context of RNA secondary structure prediction, gradient-based hy-

perparameter learning is a practical and effective method for tuning hyperparameters when

applied to large-scale log-linear models.

Finally we note that for neural networks, Eigenmann and Nossek (1999) and Chen and

Hagan (1999) proposed techniques for simultaneous optimization of hyperparameters and

parameters; these results suggest that similar proceduresfor faster hyperparameter learning

that do not require a doubly-nested optimization may be possible.

Chapter 6

A max-margin model for efficient

simultaneous alignment and folding of

RNA sequences

The need for accurate and efficient tools for computational RNA structure analysis has

become increasingly apparent over the last several years: RNA folding algorithms underlie

numerous applications in bioinformatics, ranging from microarray probe selection to de

novo noncoding RNA gene prediction.

In this chapter, we present RAF (RNA Alignment and Folding), an efficient algorithm

for simultaneous alignment and consensus folding of unaligned RNA sequences. Algorith-

mically, RAF exploits sparsity in the set of likely pairing and alignment candidates for each

nucleotide (as identified by the CONTRAfold or CONTRAlign programs) to achieve an ef-

fectively quadratic running time for simultaneous pairwise alignment and folding. RAF’s

fast sparse dynamic programming, in turn, serves as the inference engine within a discrim-

inative machine learning algorithm for parameter estimation.

In cross-validated benchmark tests, RAF achieves accuracies equaling or surpassing

the current best approaches for RNA multiple sequence secondary structure prediction.

However, RAF requires nearly an order of magnitude less time than other simultaneous

folding and alignment methods, thus making it especially appropriate for high-throughput

studies.

92


6.1 Introduction

The secondary structure adopted by an RNA molecule in vivo is avital consideration in

many bioinformatics analyses. In PCR primer design, stable secondary structures can ob-

struct proper binding of the primer to DNA (Dieffenbach et al., 1993); in RNA folding

pathway studies, secondary structure forms the basic scaffold on which more complicated

three-dimensional structures organize (Brion and Westhof,1997); and in computational

noncoding RNA gene prediction, RNA secondary structural stability provides the charac-

teristic signal for distinguishing real RNA sequence from nonfunctional transcripts (Eddy,

2002).

To date, the most powerful non-experimental methods for determining RNA secondary

structure rely primarily on position-specific patterns of nucleotide covariation in multiple

homologous RNA sequences. Specifically, enrichment for complementarity in pairs of

columns from an RNA multiple alignment, especially when primary sequence is not con-

served, provides strong evidence for potential base-pairings in the RNA’s in vivo structure.

A primary limitation of covariation analysis, however, is the difficulty of obtaining reliable

sequence alignments for divergent RNA families. This shortcoming is especially relevant

in the detection of ncRNA genes, as secondary structural constraints often exist even when

primary sequence conservation is lacking (Torarinsson et al., 2006).

In this chapter, we describe RAF, a new algorithm for predicting RNA secondary struc-

ture from a collection of unaligned homologous RNA sequences. Algorithmically, RAF

belongs to a category of RNA secondary structure prediction methods whichsimultane-

ouslyalign and fold RNA sequences. By optimizing a pair of unalignedRNA sequences for

both sequence homology and structural conservation concurrently, simultaneous alignment

and folding approaches sidestep the usual problem of needing accurate sequence align-

ments before the folding is done. By exploiting sparsity in the set of likely base-pairings

and aligned nucleotides, RAF achievesO(L2) running time for sequences of lengthL, im-

proving significantly upon theO(L4) running times of typical simultaneous folding and

alignment approaches.

The main contribution of RAF, however, is its application of discriminative machine

learning techniques for parameter estimation to the problem of simultaneous alignment and

94 CHAPTER 6. SIMULTANEOUS ALIGNMENT AND FOLDING

(a) (b) (c)

Figure 6.1: Sparsity patterns in posterior probability matrices. Panels (a) and (b) illus-trate the pairwise pairing posterior probabilities for twodifferent sequences (such as gener-ated by a single-sequence probabilistic or partition-function–based RNA folding program).Panel (c) shows the alignment match probabilities for thesesequences (such as generatedby a probabilistic HMM). In each panel, the darkness of each square represents the pos-terior confidence in the corresponding base-pairing or alignment match. While the singlesequence folder or the pairwise sequence aligner may not be able to identify the singlecorrect folding or alignment, respectively, the set of likely candidate base-pairings andmatched positions, nonetheless, is extremely sparse.

6.2. METHODS 95

folding. Unlike previous methods, RAF’s scoring model does not rely on ad hoc combi-

nations of thermodynamic free energies for structural features (Mathews et al., 1999) with

arbitrary alignment match and gap penalties (Hofacker et al., 2002), nor does RAF attempt

the ambitious task of simultaneously modeling the evolutionary history of both sequences

and structure (Knudsen and Hein, 2003). Instead, RAF defines afixed set of basis features

describing aspects of the alignment, RNA secondary structure, or both. RAF then poses the

task of learning weights for these features as a convex optimization problem, giving rise to

efficient algorithms with guaranteed convergence to optimality.

The concept of using discriminative methods for parameter estimation rather than re-

lying solely on parameters compiled from experimental measurements originated with the

CONTRAfol program (see Chapter 4), and later also became the basis of the CG (An-

dronescu et al., 2007) method. In a manner analogous to thesetwo previous methods for

single sequence secondary structure prediction, RAF demonstrates that automatic learning

of parameters can also confer benefits to multiple sequence structure prediction accuracy.

6.2 Methods

The RAF algorithm consists of four components: (1) a simple yet flexible objective func-

tion for pairwise alignment and folding of unaligned RNA sequences; (2) a fast Sankoff-

style inference engine for maximizing this objective function via sparse dynamic program-

ming; (3) a simple progressive strategy for extending the pairwise algorithm to handle

multiple unaligned sequence inputs; and (4) a max-margin framework for automatically

learning model parameters from training data. We describe each of these in turn.

6.2.1 The RAF scoring model

We begin our description of the algorithm by describing a scoring scheme for alignments

and consensus foldings of two sequences. Leta andb be a pair of unaligned input RNA

sequences. We refer to a candidate alignment and consensus secondary structure ofa and

b collectively as aparse. Formally, a parsey for a pair of sequencesa andb is a set whose

elements consist of base-pairings(ai, aj) belonging to sequencea, base-pairings(bk, bl)


belonging to sequenceb, and aligned positions(ai, bk) betweena and b. We say that a

parsey of inputsa andb is valid provided that (1) each nucleotide ofa andb base-pairs

with at most one other nucleotide in the same sequence; (2) each nucleotide aligns with at

most one nucleotide in the opposite sequence; (3) neither sequence contains pseudo-knotted

base-pairings; (4) the alignment of the two sequences does not contain rearrangements or

repeats; and (5) all base-pairings are conserved.

For a given parsey from the space of all valid parsesY, RAF uses a simple scoring

scheme which takes into account aligned positions and conserved base pairings. Specifi-

cally, RAF defines the score, SCORE(y;w), of such a parsey to be

∑

(ai,bk)∈y

ψalignedw (i, k) +

∑

〈(ai,aj),(bk,bl)〉∈B(y)

ψpairedw (i, j; k, l),

whereψalignedw (i, k) andψpaired

w (i, j; k, l) are scoring terms for aligned positions and con-

served base pairs, respectively, and whereB(y) is the set of all conserved base-pairings. In

turn, RAF models each scoring term as a linear combination of arbitrary basis features (see

Appendix B.1):

ψalignedw (i, k) =

naligned∑

p=1

wp · φalignedp (i, k)

ψpairedw (i, j; k, l) =

npaired∑

q=1

wq+naligned · φpairedq (i, j; k, l),

wherew ∈ Rnaligned+npaired = R

n is a vector of scoring parameters.

6.2.2 Fast pairwise alignment and folding

Given the scoring scheme described in the previous section,the problem of simultaneous

alignment and folding reduces to the optimization problem,

y∗ = arg maxy∈Y

SCORE(y;w). (6.1)

6.2. METHODS 97

In principle, the solution to (6.1) follows immediately from the original dynamic pro-

gramming algorithm for simultaneous alignment and foldingpresented by Sankoff (1985).

Sankoff’s algorithm, however, has anO(L3K) time complexity andO(L2K) space com-

plexity for K sequences of lengthL, rendering it impractical for all but the smallest mul-

tiple folding problems. Therefore, most programs for RNA simultaneous alignment and

folding use heuristics to reduce time and memory requirements while minimally compro-

mising alignment and structure prediction quality. Some heuristics used in previous pro-

grams have included incorporating structural informationinto a single alignment scoring

matrix (Dalli et al., 2006), disallowing multi-branch loops (Gorodkin et al., 1997), and

precomputing potential conserved helices prior to alignment (Touzet and Perriquet, 2004;

Tabei et al., 2006).

The most popular heuristics, however, involve reduction ofthe portion of the dynamic

programming matrices (which we call theDP region) that must be computed. For example,

some methods restrict the DP region to a strip of fixed width about the diagonal (Mathews

and Turner, 2002; Hofacker et al., 2004) or about an initial alignment path (Kiryu et al.,

2007). Other methods rely on external single-sequence folding and probabilistic alignment

programs to generate base pairing probability matrices (Torarinsson et al., 2007; Will et al.,

2007) or alignment match posterior probability matrices (Kiryu et al., 2007), and then

exploit the sparsity of these matrices in order to reduce theamount of computation required.

The RAF algorithm adopts the last of these strategies. Namely, RAF uses a single-

sequence RNA secondary structure prediction program (CONTRAfold, see Chapter 4) and

a pairwise RNA sequence alignment program (CONTRAlign, see Chapter 3)1, respectively,

to construct aconstraint setC of allowed base-pairs and aligned positions ina andb. Given

a constraint setC, RAF then replaces (6.1) with the reduced inference problem,

y∗ = arg maxy∈YC

SCORE(y;w), (6.2)

whereYC = y ∈ Y : y ⊆ C is the space of valid parses, restricted to those which contain

only base-pairings and alignment matches from the constraint setC (see Figure 6.1).

1The original CONTRAlign program was designed for protein sequences. We adapted this for RNAs byremoving all protein-specific features (e.g., hydrophobicity), modifying the underlying alphabet (A, C, G, U)and simply retraining on the appropriate training set.


To obtain the set of allowed base-pairings, RAF uses the implementation of McCaskill’s

algorithm (McCaskill, 1990) from CONTRAfold in order to compute the posterior proba-

bility of each possible base-pairing in sequencea, and similarly for sequenceb. All base-

pairs with posterior probability at leastǫpaired are then retained. Similarly, to determine the

set of allowed aligned positions, RAF retains those matches whose posterior probability,

according to a version of the CONTRAlign program adapted for RNAs, is at leastǫaligned.

If these cutoffsǫaligned and ǫpaired are chosen to be too low, then the reduction of the dy-

namic programming space achieved forYC will not be significant. Conversely, a higher

cutoff could also degrade performance by excluding portions of the DP matrix which actu-

ally correspond to the true parse of the input sequences. A similar approach for pruning the

space of candidate alignments and folds viafold and alignment envelopeswas implemented

in the Stemloc (Holmes, 2005) program. A number of other programs exploit either base-

pairing sparsity (Will et al., 2007; Torarinsson et al., 2007) or alignment sparsity (Kiryu

et al., 2007; Dowell and Eddy, 2006; Harmanci et al., 2007) separately.

AssumingO(c) andO(d) bounds on the number of candidate base-pairing and align-

ment partners, respectively, per position of both sequences, we show that the time com-

plexity of the RAF algorithm scalesquadraticallyin the length of the sequences, while the

space complexity scaleslinearly (see Appendix B.2). A comparison table of asymptotic

time and space complexity of a number of modern RNA simultaneous folding and align-

ment approaches is shown in Table 6.1. In practice, we find that RAF’s scaling reflects the

theoretical bounds, achieving running times often an orderof magnitude faster than current

simultaneous alignment and folding methods.2

6.2.3 Extension to multiple alignment

Using the RAF pairwise alignment subroutine, we can also address the problem of aligning

two alignments. LetS andT be two sets of sequences that we wish to align; furthermore,

2We note that the method described here bears some relation tothe “candidate list” algorithm of Wexleret al. (2007), which maintains sparse lists of potential bifurcation points for single sequence folding. Byshowing that the number of relevant bifurcation points has anegligible dependence on sequence length, theauthors provide an effectively quadratic time algorithm for single sequence folding. Here, our algorithmalso relies on sparsity of bifurcation point candidates when dealing with pairwise alignment and folding, butunlike in the previous algorithm, the candidates are provided explicitly via the constraint setC.

6.2. METHODS 99

Algorithm Time complexity Space complexitySankoff O(L6) O(L4)FOLDALIGN O(L4) O(L4)LocARNA O(c2L4) O(c2L2)Murlet O(d2L2 + d3L3/κ6) O(d2L2)RAF O(min(c, d) · cd2L2) O(min(c, d) · cdL)

Table 6.1: Comparison of computational complexity of RNA simultaneous folding andalignment algorithms. Here,L denotes the sequence length,c is the number of candidatebase-pairs per position,d is the number of candidate alignment matches per position, andκ is the minimum allowed distance between adjacent helices.

we denote their corresponding alignments asA andB.

To align a pair of alignments, we first define new basis featuresφalignedp (i, k)naligned

p=1 and

φpairedq (i, j; k, l)npaired

q=1 to simply be the average over all pairs of sequencess ∈ S andt ∈ Tof the basis features for alignings andt, remapped to the coordinates of the alignmentsA

andB. Second, we define the new constraint setC for aligning the two alignments to be

the union over all pairs of sequencess ∈ S andt ∈ T of the constraint sets for each pair,

again remapped to the alignment coordinates. Finally, using these new features and our

new constraint set, we simply call the existing RAF subroutine for fast pairwise alignment

and folding.

Using this new subroutine for aligning alignments, we can then perform multiple align-

ment in RAF using a standard progressive strategy (Feng and Doolittle, 1987). Specifically,

we cluster the sequences with a UPGMA (Sneath and Sokal, 1962) tree-building procedure,

using the expected accuracy similarity measure (Do et al., 2005). Finally, we perform pro-

gressive alignment by aligning subgroups of sequences according to the tree.

6.2.4 A max-margin framework

Given a set of training examples,S = (a(i), b(i), y(i))mi=1, the parameter estimation prob-

lem is the task of identifying a vector of weightsw = (w1, w2, . . . , wn) ∈ Rn for which the

RAF inference algorithm, as described in the previous section, will yield accurate align-

ments and consensus structures. In this section, we presenta max-margin framework for

parameter estimation in RAF.


Formulation

In the max-margin framework, our goal is to obtain a parameter vectorw for which running

the RAF inference algorithm will generate accurate alignments and consensus structures.

Clearly, this goal is met if for each training example(a(i), b(i), y(i)) from our training set

S,3

SCORE(y(i);w) > SCORE(y′;w), ∀y′ ∈ Y(i)C \ y(i). (6.3)

In such a case, we would be guaranteed that the maximum of (6.2) is attained fory∗ = y(i)

(provided the true parsey(i) belongs toY(i)C ), and hence our inference procedure would

necessarily return the correct alignment and consensus folding. This intuition is captured

in the following convex optimization problem:

minimizew∈Rn,ξ∈Rm

12C||w||2 + 1

m

∑mi=1 ξi

subject to SCORE(y(i);w)− SCORE(y′;w) ≥ ∆(y(i), y′)− ξi,i = 1, . . . ,m,

y′ ∈ Y(i)C ∪ y(i).

(6.4)

Here,C is a regularization constant, and∆(y(i), y′) is a non-negative distance measure

between pair of parses, conventionally referred to as theloss function, which takes value0

if and only if its two arguments are equal (see Section 6.2.4).

The inequality constraints play the role of (6.3)—they try to ensure that the training

outputy(i) scores higher than any alternative incorrect parsey′ by some positive amount

∆(y(i), y′). In cases where this condition is not achieved, the objective function incurs a

penalty ofξi. Finally, the regularization term12C||w||2 is a penalty used to prevent overfit-

ting.4

3Note that our notation hides the dependencies of the SCOREfunction on each of the input sequencesa(i)

andb(i), and similarly for the unconstrained and constrained spaceof parses,Y(i) andY(i)C .

4By default, we usedC = 1. We found that when running the online Pegasos optimizationalgorithm (seeSection 6.2.4) for a fixed number of iterations, the resulting generalization performance for RAF is relativelyinsensitive to the value ofC used, provided thatC is not too large.

6.2. METHODS 101

The loss function

The loss function∆(y(i), y′) in (6.4) plays two significant roles. Technically, the loss func-

tion establishes an appropriate scale for the parameters ofthe problem and prevents the

trivial solution,w = 0. Intuitively, however, the loss function also helps to makethe max-

margin optimization robust. By choosing a loss function thattakes large positive values for

incorrect candidate outputsy′ that differ from the true outputy(i) in a very critical way, but

that takes small positive values for incorrect candidate outputsy′ whose errors are more

forgivable, the loss function allows the user to implement anotion of “cost” for different

types of mistakes in the max-margin model.

For RAF, we defined the loss function by restricting our attention to four types of pars-

ing errors: (1)false positive base-pairings((ai, aj) ∈ y′ \ y(i), or similarly in sequence

b), (2) false negative base-pairings((ai, aj) ∈ y(i) \ y′, or similarly in sequenceb), (3)

false positive aligned matches((ai, bk) ∈ y′ \ y(i)), and (4)false negative aligned matches

((ai, bk) ∈ y(i) \ y′). Then, we set

∆(y(i), y′) = γFP paired· # of false positive base-pairings

+ γFN paired· # of false negative base-pairings

+ γFP aligned· # of false positive aligned matches

+ γFN aligned· # of false negative aligned matches.

The numbersγFN paired, γFP paired, γFN aligned, andγFP alignedare hyperparameters, chosen by

the user prior to training the RAF algorithm, which allow the user to express her preference

for models with either high sensitivity or high specificity for base-pairing positions and

aligned nucleotides.5

5By default, we usedγFN paired = 10, andγFP paired= γFN aligned = γFP aligned= 1 in order to emphasizeprediction of correct base-pairings.


Optimization algorithm

For structured max-margin models, a few standard approaches exist for performing opti-

mization: (1) constraint generation/cutting-plane methods, (2) exponentiated gradient al-

gorithms, and (3) subgradient optimization techniques. Weprovide a brief discussion of

each of these three generic schemes below, followed by a description of a specific variant

of the third technique that we use in our implementation of RAF.

At first glance, the constrained optimization problem stated in (6.4) appears to be a

standard convex quadratic program and hence solvable usingoff-the-shelf packages for

convex programming. In reality, for each training example,the optimization problem has

an exponential number of inequalities, one corresponding to each possible candidate parse

y′ of the input sequences! Despite our use of constraints sets to reduce the set of allowed

candidate outputs, in most cases, this space is still too large to enumerate. Algorithms

specially-designed to deal with this include:

1. Constraint generation/cutting plane methods: In this class of methods, the princi-

pal intuition is that although the original optimization problem contains an exponen-

tially large constraint set, in practice, one can always findan approximate solution of

the original problem by solving a reduced optimization problem with a small (poly-

nomial) number of constraints. The constraint generation algorithm, in particular, is

an iterative method based on this strategy that works by incrementally adding con-

straints to some initial small constraint set until the desired level of approximation is

achieved. In each iteration, the current reduced optimization problem is solved using

either a black-box or special-purpose quadratic programming solver.

Though constraint generation algorithms can often be quiteeffective, in practice, they

are sometimes cumbersome to implement, as a quadratic program must be solved

in each stage of the method. We return to a discussion of this type of approach

in Chapter 7, where we describe a refinement of it based on cutting-plane/bundle

methods. Constraint generation algorithms are used in the CG (Andronescu et al.,

2007) program for RNA parameter estimation.

2. Exponentiated gradient algorithms: These methods deal with the exponential num-

ber of constraints from structured max-margin methods by considering a special

6.2. METHODS 103

transformation of the original problem known as its(Lagrange) dual(Boyd and Van-

denberghe, 2004). While a full discussion of Lagrange duality and its implications

is beyond the scope of this text, the essential property of Lagrange dual exploited

in exponentiated gradient algorithms is that the variablesof the Lagrange dual prob-

lem correspond to constraints in the original problem; as a result, the exponential

number of constraints in the original imply an exponential number of variables in the

Lagrange dual.

To make this dual formulation tractable, exponentiated gradient algorithms do two

things. First, they rely on the fact that for many problems, the dual variables can be

factorized in much the same way that certain probability distributions can be com-

pactly using graphical models. Second, they use a specific type of coordinate descent

rule for updating the compact factorization of the dual variables in each step. The end

result, then, is an optimization algorithm for the Lagrangedual optimization problem,

whose solution, then, can be transformed into a solution of the original max-margin

problem. For the max-margin problem described in this chapter, however, the size of

dual factorization, though polynomial, is still prohibitively large. Thus exponentiated

gradient algorithms are less applicable in this case. In theliterature, exponentiated

gradient algorithms for structured models were first suggested in Bartlett et al. (2004)

and later improved in Collins et al. (2007) (see also Globerson et al. (2007)).

3. Subgradient optimization: This final class of methods relies on the fact that the

original max-margin constrained optimization problem hasan equivalent reformula-

tion as an unconstrained optimization problem: namely, minimize (with respect to

w ∈ Rn) the objective functionf(w), defined as

1

m

m∑

i=1

maxy′∈Y(i)

C∪y(i)

SCORE(y′;w) + ∆(y(i), y′)− SCORE(y(i);w)

+1

2C||w||2.

(6.5)

Here, the equivalence follows from the fact that each of the constraints in (6.4) can


be written in the form,

ξi ≥ ∆(y(i), y′) + SCORE(y′;w)− SCORE(y(i);w), (6.6)

Since the objective function attempts to minimizeξi, then the smallest value ofξi

satisfying all these constraints will always be

ξi = maxy′∈Y(i)

C∪y(i)

[

∆(y(i), y′) + SCORE(y′;w)− SCORE(y(i);w)]

. (6.7)

Substituting this into the objective function from (6.4) gives (6.5).

Given the above unconstrained optimization problem, subgradient optimization al-

gorithms take the direct approach of using a gradient-descent–like procedure to find

the optimal solution. The simplest subgradient optimization algorithm relies on the

iteration,

wt+1 ← wt − ηt · gt, (6.8)

starting fromw1 = 0. Here,gt ∈ ∂f(wt) is any subgradient6 of the objective func-

tion f(w) evaluated atw = wt, andηt is a step size picked according to some prede-

fined scheme. In each step, the parameter vectorw is moved in the direction of the

negative subgradient of the objective function. Subgradient optimization algorithms

are often quite popular for their ease of implementation andefficiency.

Recently, Shalev-Shwartz et al. (2007) introduced a highly efficient subgradient opti-

mization algorithm for training support vector machines (SVMs) known as PEGASOS. In

the PEGASOS algorithm, the parameter iterates are determined by the following modified

“projected” subgradient iteration,

wt+1 ← ΠB

[

wt −1

Ct· gt

]

. (6.9)

6Here, thesubdifferential∂f(x) of a convex functionf : Rn → R at x is the set of all vectorsg such

that f(y) ≥ f(x) + gT(y − x) for all y ∈ Rn; elements belonging to the subdifferential are known as

subgradients. Intuitively, the subdifferential provides a generalization of the concept of gradients to the caseof non-differential, but still convex, functions.

6.2. METHODS 105

Here,B represents a closed, bounded convex set to which the optimumsolution of the op-

timization problem belongs (namely the Euclidean ball of radius1/√C centered at the ori-

gin). The operatorΠB[·] projects vectors inRn ontoB (i.e.,ΠB[v] = arg minu∈B ‖v − u‖2for anyv ∈ R

n). The term1/Ct represents the specific learning rate schedule used in the

PEGASOS algorithm. Finally,gt denotes a subgradient of the objective function being op-

timized. Intuitively, the algorithm works much like a standard gradient descent procedure

adapted for non-differentiable objective functions, but with the added twist that the projec-

tion operation ensures that the weight vector iterates staywith a region of the parameter

space where the optimum is known to exist. Each of the operations involved in the PEGA-

SOS algorithm are extremely easy to implement, so the computational cost of the algorithm

is essentially the same as that of standard subgradient approaches.

Unlike regular subgradient algorithms, however, the PEGASOS algorithm enjoys sub-

stantially faster convergence rates, both in theory and practice. The algorithm itself derives

from pre-existing logarithmic regret algorithms originally designed in the context of online

convex programming (Hazan et al., 2007). As shown in Shalev-Shwartz et al. (2007),

a straightforward implementation of the algorithm requires only O(1/Cǫ) iterations to

achieveǫ accuracy (where theO(·) notation hides logarithmic factors).7 Each iteration

takesO(m) time, as computing a subgradientgt of the objective function requires com-

puting subgradients for each individual training examples, and averaging these together

(see the structure of the objective function in (6.5)). By modifying the algorithm to use

“stochastically estimated” subgradients (i.e., approximatinggt using the subgradient for a

single randomly chosen training example), the time per iteration is reduced toO(1). Re-

markably, since the estimate of the subgradient is unbiased(though stochastic), it turns

out that the convergence guarantees on the number of iterations needed to convergence to

within ǫ error still hold in expectation (despite the fact that each iteration now takes only

O(1) time to compute).8 This last result is noteworthy in the sense that it implies that

the PEGASOS algorithm is always able to converge to good accuracy after performing

O(1/Cǫ) work, independent of the size of the training set!

Here, we adapt the PEGASOS algorithm, designed for SVM optimization, to the task

7By comparison, standard gradient descent analyses typically yield O(1/ǫ2) rates (Nesterov, 2003).8Alternatively, one can also prove that this convergence in expectation also implies that the algorithm will

converge quickly with high probability. See Chapter 7 for examples of this type of argument.


of training max-margin structured models. The essential steps in our derivation include

(1) providing a new derivation for the size of the appropriate Euclidean ballB to be used

in each projection step, and (2) describing how to compute subgradients of the objective

function. The convergence analysis of the resulting algorithm remains the same as in the

regular PEGASOS algorithm, so all the original convergenceresults still apply.

The derivation of the appropriate size for the Euclidean ball is rather technical, and

we leave it to Appendix B.3. To compute a subgradientgt ∈ ∂f(wt), we first define an

n-dimensional vectorΦ(y) whosepth component is

Φp(y) =

∑

(ai,bk)∈y

φalignedp (i, k) if 1 ≤ p ≤ naligned

∑

〈(ai,aj),(bk,bl)〉∈B(y)

φpairedp−naligned

(i, j; k, l) if naligned+ 1 ≤ p ≤ n,

from which it follows that SCORE(y;w) = wTΦ(y). We can apply the usual rules for

computing subgradients (see, e.g., Bertsekas et al. (2003))to obtain

gt = Cwt +1

m

m∑

i=1

(

Φ(y(i)∗ )−Φ(y(i))

)

, (6.10)

wherey(i)∗ is simply anyy′ which attains the maximum in theith term of the summation

in (6.5), forw = wt. Each “loss-augmented” maximization, in turn, is easily performed

by modifying the original RAF inference procedure to incorporate an appropriately defined

additional scoring matrix,φ0(i, j; k, l), with fixed weightw0 = 1. Alternatively, in the case

of “stochastically estimated” subgradients, the above formula is approximated as

gt ≈ Cwt + Φ(y(i)∗ )−Φ(y(i)), (6.11)

for somei chosen uniformly at random from1, . . . ,m.

6.3. RESULTS 107

6.3 Results

To evaluate the performance of RAF on real data, we collected training and testing data

from a variety of sources. In particular, for training, we obtained Rfam 8.1 (Griffiths-

Jones et al., 2005), a database of alignments and covariancemodels for RNA families along

with annotated secondary structures where available. For testing, we obtained BRAliBASE

II (Gardner et al., 2005), a benchmark set for RNA alignment programs. We also obtained a

testing set of RNA families used by the authors of the recent program, MASTR (Lindgreen

et al., 2007).

An important concern in the validation of RNA alignment programs is the confounding

factor that unless cross-validation is properly performed, the performance that one sees on

any given validation set is not likely to be a reliable judge of the program’s performance

on future data. Even in cases where the training and evaluation tests are disjoint but still

contain sequences from the same RNA family, evaluation can still give misleading results,

because the weights learned for loop lengths and composition will be biased toward specific

properties of that RNA family.

To be absolutely sure of no contamination between training and testing data, we prepro-

cessed our Rfam training set of alignments and consensus structures (October 2007 version,

607 families) by excluding all families for which either of the two testing databases con-

tained an example from that family. We then also removed all families for which only

automatically predicted consensus structures were known,leaving a total of 154 families.

Finally, we generated a training setT1 of up to 10 randomly sampled pairwise alignments

with consensus structures from each remaining family (1361pairwise alignments in total),

a training setT2 of up to 10 randomly sampled sequences with structures from each family

(1179 sequences in total), and a training setT3 containing one randomly sampled five-way

multiple alignment from each family (118 multiple alignments in total).

RAF uses two external programs, CONTRAlign and CONTRAfold, to compute align-

ment match and base-pairing posterior probabilities, respectively. To ensure proper cross-

validation, CONTRAlign was retrained from scratch usingT1, and CONTRAfold was re-

trained usingT2. Finally, the RAF algorithm itself was trained using all pairwise projec-

tions of each multiple alignment ofT3. Our strict cross-validation procedure significantly


0 5 10 15 200

0.2

0.4

0.6

0.8

1

Sparsity ratio

Pro

port

ion

of r

efer

ence

rec

over

ed

aligned matchesbase pairs

Figure 6.2: Trade-off between sparsity factor and proportion of reference base-pairings oraligned matches covered when varying the cutoffsǫpaired andǫaligned. This graph was madeusing training setT3.

reduces both the size and coverage of the training sets used for CONTRAlign and CON-

TRAfold, and thus places RAF at a significant disadvantage in the comparisons shown

here. Nonetheless, as shown in the following sections, RAF performs well, indicating its

ability to generalize for sequences not present in the training set.

6.3.1 Alignment and base-pairing constraints

To observe the effects of different cutoffsǫaligned andǫpaired, we computed the proportions

of reference base-pairings and reference aligned matches recovered for varying cutoff con-

straints. In addition, we also computed the sparsity ratio (i.e., the maximum number of

pairing partners or matching partners for any nucleotide, averaged over the entire training

set) for each cutoff. A plot of these two values for training setT3 is shown in Figure 6.2. As

seen in the figure, nearly complete coverage of base-pairings and alignment matches can

be retained when each sparsity factor is roughly 10.9

9In practice, we found that using cutoffs ofǫaligned ≈ 0.01 and ǫpaired ≈ 0.002 gave a good trade-offbetween speed and accuracy of our algorithm when using CONTRAlign and CONTRAfold; these cutoffscorrespond roughly to average sparsity factors of approximately 10 each, respectively.

6.3. RESULTS 109

6.3.2 Evaluation metrics

To evaluate the quality of the resulting alignments, we usedfive different scoring measures:

1. the standard sum-of-pairs (SP) score (Thompson et al., 1999), which computes the

proportion of matches in a reference alignment which are present in the predicted

alignment,

2. sensitivity (Sens), the proportion of base-pairings in areference parse which are re-

covered in the predicted parse,

3. specificity or positive predictive value (PPV), the proportion of base-pairings in a

predicted parse which are also present in the reference parse, and

4. the Matthews correlation coefficient (MCC) (Matthews, 1975), which we approxi-

mate as√

Sens· PPV, following Gorodkin et al. (2001).

6.3.3 Comparison of accuracy

In our first accuracy assessment, we evaluated RAF as well as a number of other current

RNA secondary structure prediction programs using the BRAliBASE II dataset. In particu-

lar, the first dataset from BRAliBASE II contains collections of 100 five-sequence subalign-

ments, sampled from five specific Rfam families (5S rRNA, group II intron, SRP, tRNA,

and U5). For each of these alignments, we ran a number of current multiple-sequence RNA

secondary structure prediction programs, including Murlet v0.1.1 (Kiryu et al., 2007), Lo-

cARNA v1.2.2a (Will et al., 2007), and RNA Sampler v1.3 (Xu et al., 2007). Wherever

any of these programs required access to external pairing posterior probabilities, we used

ViennaRNA v1.7 (Hofacker et al., 1994). The results of the comparison are shown in Fig-

ure 6.3.

As seen from the table, on the BRAliBASE II benchmark, RAF attains comparable

accuracy to the other methods, achieving either the best or second-best overall accuracy

according to MCC on four out of the five datasets. The running time of the method, how-

ever, is dramatically faster than the other algorithms, often taking an order of magnitude

less time than many of the other programs.

We also obtained the dataset used in the benchmarking of the MASTR RNA secondary


Dataset Program Time (s) SP Sens PPV MCC5S rRNA Murlet 687 0.94 0.70 0.70 0.70

LocARNA 812 0.93 0.55 0.60 0.57RNA Sampler 2361 0.90 0.55 0.64 0.59RAF 87 0.95 0.66 0.66 0.66

group II intron Murlet 962 0.78 0.75 0.76 0.75LocARNA 250 0.74 0.79 0.65 0.72RNA Sampler 1626 0.72 0.77 0.65 0.71RAF 48 0.78 0.83 0.65 0.73

SRP Murlet 20548 0.88 0.75 0.78 0.76LocARNA 22467 0.85 0.66 0.70 0.68RAF 1290 0.87 0.72 0.71 0.70

tRNA Murlet 525 0.93 0.86 0.90 0.88LocARNA 246 0.95 0.86 0.90 0.88RNA Sampler 763 0.92 0.93 0.91 0.92RAF 52 0.94 0.81 0.85 0.83

U5 Murlet 1772 0.84 0.69 0.75 0.72LocARNA 549 0.80 0.56 0.61 0.58RNA Sampler 4084 0.77 0.75 0.70 0.72RAF 99 0.82 0.83 0.79 0.81

Figure 6.3: Performance comparison on BRAliBASE II datasets.The best number in eachcolumn is marked in bold.

6.3. RESULTS 111

Program SP Sens PPV MCCCLUSTAL W + Alifold 0.81 0.57 0.73 0.65FoldalignM 0.78 0.38 0.81 0.55LocARNA 0.75 0.41 0.77 0.56MASTR 0.84 0.64 0.73 0.68Murlet 0.89 0.62 0.78 0.70RNAforester 0.53 0.55 0.55 0.55RNA Sampler 0.82 0.65 0.70 0.67RAF 0.88 0.68 0.77 0.72

Figure 6.4: Performance comparison on MASTR benchmarking sets. The best number ineach column is marked in bold.

structure prediction program. For a number of different programs, pre-generated predic-

tions for each input file are available for download on the MASTR website. In addition

to scoring these pre-generated predictions, we also generated and scored predictions using

Murlet and RAF. The results are shown in Figure 6.4. In this benchmark set, RAF obtains

the highest overall MCC.

We emphasize, however, that benchmarking results such as these should be taken with

a grain of salt; both the BRAliBASE II and MASTR benchmarking sets are extremely

restricted in their coverage of the space of RNA families, choosing to focus on a few in-

dividual RNA families only. As a result, methods carefully tuned to the benchmarks may

perform less well on diverse RNA families not found in either of these benchmarks. By

using cross-validation, we improve the chances that RAF’s validation results really do in-

dicate reliable out-of-sample performance.

We also note that the performance of RAF on particular RNA families is often closely

related to the accuracy of the underlying alignment and single sequence models used to

derive folding and alignment constraints. Because the toolsinvolved in the RAF pipeline

all rely on automatic parameter learning, RAF allows the possibility of learning custom

parameter sets well-suited for predictions on particular RNA families.


6.4 Discussion

We presented RAF, a new tool for simultaneous folding and alignment of RNA sequences

which exploits sparsity in base-pairing and alignment probability matrices and max-margin

training in order to achieve faster running times and higheraccuracy than previous tools.

Besides its speed, one principal advantage of the RAF methodology is its use of a flex-

ible scoring function for combining an arbitrary set of functions into a coherent objective

function for alignment scoring. The ability to introduce new basis scoring functions into

the RAF scoring model means that there remains a rich space of possible features to ex-

plore.

In addition, the use of the max-margin framework to identifyrelevant linear combina-

tions of scoring functions has other promising potential applications. For example, Wallace

et al. (2006) recently introduced M-Coffee, a meta-algorithm for protein sequence align-

ment which combines the results of several different protein sequence alignment programs

using the T-Coffee framework. The difficulty of identifying appropriate weights for the

various programs used in the M-Coffee scoring scheme (i.e., some heuristically derived

tree-based weights the authors tried did not give a significant improvement in accuracy

over flat weights), led the authors to rely on a uniform weightmodel, treating programs

known to be more accurate on equal footing with less accuratealigners. The max-margin

framework developed in this chapter obviates the need for heuristically-derived weights

altogether.

Chapter 7

Proximal regularization for online and

batch learning

Many learning algorithms rely on the curvature (in particular, strong convexity) of regular-

ized objective functions to provide good theoretical performance guarantees. In practice,

the choice of regularization penalty that gives the best testing set performance may result

in objective functions with little or even no curvature. In these cases, algorithms designed

specifically for regularized objectives often either fail completely or require some modifi-

cation that involves a substantial compromise in performance.

In this chapter, we present new online and batch algorithms for training a variety of su-

pervised learning models (such as SVMs, logistic regression, structured prediction models,

and CRFs) under conditions where the optimal choice of regularization parameter results

in functions with low curvature. We employ a technique called proximal regularization, in

which we solve the original learning problem via a sequence of modified optimization tasks

whose objectives are chosen to have greater curvature than the original problem. Theoreti-

cally, our algorithms achieve low regret bounds in the online setting and fast convergence in

the batch setting. Experimentally, our algorithms improveupon state-of-the-art techniques,

including Pegasos and bundle methods, on medium and large-scale SVM and structured

learning tasks.

113

114 CHAPTER 7. PROXIMAL REGULARIZATION

7.1 Introduction

Consider the task of training a linear SVM:

minw∈Rn

λ

2‖w‖2 +

1

m

m∑

i=1

max(0, 1− y(i)wTx(i)). (7.1)

In this optimization problem, theL2 regularization penalty plays two important roles: not

only does the quadratic term prevent overfitting to the empirical loss on the training data,

but in fact, it also controls a measure of curvature of the objective function, known as its

strong convexity.

In the past several years, a number of approaches have been proposed for training linear

SVMs, ranging from batch methods such as the cutting plane algorithm Joachims (2006) to

online methods such as the PEGASOS subgradient algorithm Shalev-Shwartz et al. (2007).

In essentially all of these algorithms (for which the relevant bounds are known), theory

indicates that the number of iterations required to obtain an ǫ-accurate solution is roughly

O(1/λǫ). For example, cutting-plane methods requireO(1/λǫ) passes through the training

set Smola et al. (2007), whereas PEGASOS must processO(1/λǫ) training examples to

ensureǫ accuracy on expectation. In both cases, the theoretical bounds depend largely on

the chosen value of the regularization hyperparameterλ.

For many real world problems, however, the ideal choice ofλ can be quite small. When

this is the case, state-of-the-art cutting plane and subgradient algorithms give unnaccept-

ably slow convergence, both in theory and in practice. Recently, Bartlett et al. (2007)

described anadaptive online gradient descentalgorithm based on the simple intuition that

an objective function with low curvature can be stabilized by adding extra terms whose pur-

pose is to increase curvature. In this chapter, we extend these ideas to construct new online

and batch algorithms suitable for training a wide variety ofsupervised learning models.

Specifically, we design a sequence of optimization tasks, each of which is a variant

of the original problem modified to include an extraproximal regularizationterm. We

show how to choose these proximal terms in an adaptive fashion such that the resulting se-

quence of minimizers (or approximate minimizers) convergeto the solution of the original

optimization problem. Finally, we describe some simple heuristic modifications to these

7.2. PRELIMINARIES 115

algorithms that retain all optimality guarantees while resulting in considerable performance

improvements in practice. In the online setting, our analysis leads naturally to a stochas-

tic subgradient-style algorithm along the lines of the PEGASOS. In the batch setting, our

analysis yields an improved cutting-plane/bundle method.Both in theory and in experi-

ments, our methods exhibit comparable performance for largeλ (high curvature) compared

to existing methods and dramatic improvements for smallλ (low curvature).

7.2 Preliminaries

Let ‖·‖ denote the Euclidean norm,‖x‖ :=√

xTx. Given a pointx ∈ Rn and a com-

pact (i.e., closed, bounded) subsetS ⊆ Rn, let ΠS[x] := arg miny∈S ‖x− y‖ denote the

Euclidean projection ofx onto S. For notational convenience, we use notational short-

handca:b :=∑b

i=a ci for any sequence of scalarsca, ca+1, . . . , cb−1, cb ∈ R. For a vector

x ∈ Rn andc ∈ R, let [x; c] ∈ R

n+1 denote the concatenation ofc onto the end ofx. For

x,y ∈ Rn, let x y denote the component-wise inequalities,xi ≥ yi,∀i. Let 0 and1

denote the vectors of all 0’s and all 1’s, respectively.

A function f : Rn → R is said to beλ-strongly convexif for any x,y ∈ R

n and any

subgradientg belonging to the subdifferential∂f(x) of f at x, f(y) ≥ f(x) + gT(y −x) + λ

2‖y − x‖2.1 Here, we consider learning problems associated with the optimization

of λ-strongly convex functions in both the online and batch settings.

In the online setting, we base our analyses on the concept of aconvex repeated game. A

convex repeated game is a two-player game consisting ofT rounds. During roundt, the first

player proposes a vectorwt belonging to some compact convex setS, the second player

responds by choosing aλt-strongly convex function of the formft(w) := λt

2‖w‖2 + ℓt(w)

for some convex functionℓt, and then the first player suffers lossft(wt). We assume that

λt ≥ 0, and that the same setS is used in each round; for simplicity, we assume throughout

thatS is an origin-centered closed ball of radiusR. Here, we seek an algorithm to minimize

the first player’sregret,∑T

t=1 ft(wt) − minu∈S

∑Tt=1 ft(u), i.e., the excess loss suffered

1Recall that thesubdifferential∂f(x) of a convex functionf : Rn → R at x is the set of all vectorsg

such thatf(y) ≥ f(x) + gT(y − x) for all y ∈ Rn; elements belonging to the subdifferential are known as

subgradients.


Algorithm 1 Projected subgradient descentInitialize w1 ← 0.for t← 1, . . . , T do

Receive aλt-strongly convex functionft.Choosegt ∈ ∂ft(wt).Setηt ← 1/λ1:t.Setwt+1 ← ΠS[wt − ηtgt].

end forreturn wT+1.

Algorithm 2 Proximal projected subgradient descentInitialize w1 ← 0.for t← 1, . . . , T do

Receive aλt-strongly convex functionft.Choosegt ∈ ∂ft(wt).

Setτt ←−λ1:t−τ1:t−1+

r

(λ1:t+τ1:t−1)2+G2

tR2

2.

Setηt ← 1/(λ1:t + τ1:t).Setwt+1 ← ΠS[wt − ηtgt].

end forreturn wT+1.

compared to the minimum loss possible for any fixed choice ofw ∈ S.

In the batch setting, we are given aλ-strongly convex function of the formf(w) =λ2‖w‖2 + ℓ(w), whereℓ is again a convex function. Here, we will assumeλ > 0 in order

to ensure that the optimization problem is well-posed. Ifw∗ = arg minw∈Rn f(w), then

our goal will be to find an approximate minimizerw such thatf(w)− f(w∗) ≤ ǫ.

7.3 Online proximal learning

As a starting point, we recall the projected subgradient algorithm for strongly convex

repeated games proposed by Hazan et al. (2007) and later generalized by Bartlett et al.

(2007), as stated in Algorithm 1. In this algorithm, the firstplayer updates his parameter

vector in each round by taking a projected subgradient step,wt+1 ← ΠS[wt − ηtgt]. When

the step sizeηt = 1/λ1:t, we obtain the following regret bound Bartlett et al. (2007):

7.3. ONLINE PROXIMAL LEARNING 117

Lemma 1. Suppose thatλt > 0 and‖gt‖ ≤ Gt for t = 1, . . . , T . Then, for anyu ∈ S,

Algorithm 1 satisfies

T∑

t=1

(ft(wt)− ft(u)) ≤ 1

2

T∑

t=1

G2t

λ1:t

. (7.2)

Whenλt = λ andGt = G in each round, then the right hand side of the inequality

can be further upper-bounded byG2

2λ(1 + log T ). Algorithm 1, thus, is an example of an

algorithm with logarithmic regret. Whenλ is small, however, this guaranteed regret can

still be large.

7.3.1 Proximal regret bound

Now, suppose we run Algorithm 1 on the sequence of modified functions,

f ′t(w) := ft(w) +

τt2‖w −wt‖2 . (7.3)

for some setting of constantsτ1, . . . , τT ∈ R. We refer to the additional quadratic term

in each of our modified functions as aproximal regularization term. Whereas eachft is

λt-strongly convex, each modified functionf ′t is (λt + τt)-strongly convex. Also, since

the gradient of the proximal regularization term is zero when evaluated atwt, it follows

immediately that∂f ′t(wt) = ∂ft(wt). Thus, the updates in the proximal regularization

case differ from the non-proximal algorithm only in the choice of step sizes, since we can

still use the same subgradients.

The idea of adding temporary regularization terms in order to achieve better bounds

on the regret of a learning algorithm was first introduced in Bartlett et al. (2007), who

considered modified objective functions of the form

f ′′t (w) := ft(w) +

τt2‖w‖2 . (7.4)

Unlike in the proximal case,∂f ′′t (wt) 6= ∂ft(wt). In Section 7.5, we compare empirically

these two choices.

To analyze the proximal regularization method, we apply Lemma 1 to the sequence of


functions in (7.3) to obtain

Corollary 1. Define

RT (τ1, . . . , τT ) :=1

2

T∑

t=1

[

4τtR2 +

G2t

λ1:t + τ1:t

]

. (7.5)

For any fixedτ1, . . . , τT ≥ 0, running Algorithm 1 on the sequence of functionsf ′1, . . . , f

′T

from (7.3)gives

T∑

t=1

(ft(wt)− ft(u)) ≤ RT (τ1, . . . , τT ). (7.6)

Here, the proof depends on the fact that‖w −wt‖ ≤ 2R for any w,wt ∈ S. The

strength of the regret bound, depends on the choice of constantsτ1, . . . , τT . The key to the

proximal regularization algorithm, then, is picking theseconstants so as to ensure that the

regret is small.

7.3.2 Choosing proximal parameters

Suppose that the valuesλt andGt for t = 1, . . . , T are determined independently of the

choices made in the algorithm. We describe two approximate schemes for choosingτt’s.

The first scheme is a practical online balancing heuristic due to Bartlett et al. (2007). The

second scheme, makes the additional assumptions that theλt andGt do not vary witht

but has the benefit of allowing us to choose theτt’s so that the regret bound is as tight as

possible.

Strategy 1: Balancing heuristic. In the first approach, observe that the expression

in (7.5) consists of two terms, one of which increases and oneof which decreases asτt

increases. During thetth step of the algorithm, consider the choice ofτt ≥ 0 that ensures

that the two terms are equal, i.e.,2τtR2 =

G2t

2(λ1:t+τ1:t). This is a quadratic equation, with

positive solution,

τt = 12

(

−λ1:t − τ1:t−1 +

√

(λ1:t + τ1:t−1)2 +G2

t

R2

)

.


In Algorithm 2, we provide pseudocode for the proximal projected subgradient descent

algorithm using the balancing heuristic. Applying Lemma 3.1 from Bartlett et al. (2007),

we obtain the following bound:

Theorem 1. The regret obtained by Algorithm 2 is at most twice that of the optimal offline

choice ofτ1, . . . , τT , i.e.,

T∑

t=1

(ft(wt)− ft(u)) ≤ 2 minτ1,...,τT

1

2

T∑

t=1

[

4τtR2 +

G2t

λ1:t + τ1:t

]

.

For comparison, Bartlett et al. (2007) derived a bound of

T∑

t=1

(ft(wt)− ft(u)) ≤ minτ1,...,τT

T∑

t=1

[

3τtR2 + 2

G2t

λ1:t + τ1:t

]

,

when using the modified functions in (7.4). These two expression are not directly com-

parable, though as we show in Section 7.5, the proximal algorithm performs better in our

experiments.

Strategy 2: Bound optimization. In the second approach, we bound the regret directly,

via the following proposition:

Proposition 1. Let

(τ ∗1 , . . . , τ∗T ) = arg min

τ1,...,τT≥0RT (τ1, . . . , τT ). (7.7)

Thenτ ∗i = 0 for all i 6= 1.

The benefit of the above proposition is that it allows us to reduce an optimization over

many variables to a much simpler convex optimization problem over just a single variable,

τ ∗1 (which we simply callτ ). Whenλ = 0, then the bound reduces to2τR2 +∑T

t=1G2t/2τ ,

whose minimum occurs atτ =√

∑Tt=1G

2t/4R

2. Otherwise, ifλt = λ > 0, andGt = G,

then we can upper bound the regret with a simple closed form expression, parameterized

by τ :


Theorem 2. Under the above assumptions, letRT denote the worst-case regret suffered by

Algorithm 2. Then, for anyτ > 0, we have the upper bound,RT ≤ B(τ), where

B(τ) := 4τR2 +G2

λ

[

1

1 + τ/λ+ log

(

T + τ/λ

1 + τ/λ

)]

.

Since the upper bound is a convex differentiable function ofτ over the domainτ ≥ 0,

one could optimize the bound directly using standard line search techniques. Alternatively,

by substituting different values forτ into the expression above, we can obtain various upper

bounds on the regret that Algorithm 2 will achieve. In particular,

Corollary 2. Whenτ = 0, thenB(τ) := G2

λ(1 + log T ).

Corollary 3. Whenτ = G√

T2R

, thenlimλ→0

B(τ) = 4RG√T .

The key intuition behind the efficiency of Algorithm 2 is thatin some cases, one of these

bounds may be better than the other. For example, whenRG is sufficiently small relative

to G2

λ, the seemingly inferior square root bound can actually be better than the logarithmic

regret bound for values ofT that are not too large. Regardless of the situation, Theorem 1

implies that Algorithm 2 achieves a total regret no worse than twice the best bound for any

τ .

7.3.3 Application: Linear SVMs

In this section, we consider the task of training a linear SVM. The approach we take here

was inspired by the Pegasos algorithm Shalev-Shwartz et al.(2007), currently regarded as

one of the fastest methods for SVM training on large-scale datasets. At its core, the Pegasos

algorithm is essentially a wrapper for Algorithm 1.

Given training inputsx(i), y(i)mi=1, the Pegasos algorithm defines a sequence of func-

tions f1, . . . , fT . In the tth round, Pegasos randomly samples a subsetAt of fixed sizek

from 1, 2, . . . ,m, definesft(w) to be

λ

2‖w‖2 +

1

|At|∑

i∈At

max(0, 1− y(i)wTx(i)). (7.8)


and runs Algorithm 1 withλt = λ,Gt =√λ+ maxi

∥

∥x(i)∥

∥,R = 1√λ, andS = w ∈ R

n :

‖w‖ ≤ 1√λ. As shown in Shalev-Shwartz et al. (2007), one can guaranteeusing a strong

duality argument that the optimal solution will always havenorm at most 1√λ, so usingS

as the feasible set does not impose any additional restrictions.

To characterize the relationship between the strongly convex game defined by Pegasos

and the linear SVM training problem, we state the following theorem and its corollary, both

of whose proofs closely mirror that of Theorems 2 and 3 from Shalev-Shwartz et al. (2007):

Theorem 3. Letf : Rn → R be a (strongly) convex function, letS ⊆ R

n be compact, and

supposew∗ := arg minw∈S f(w). LetA be an algorithm for (strongly) convex repeated

games with regret boundRT . Now, suppose we runA on a sequence of (strongly) convex

functionsf1, . . . , fT which satisfy, for allt ∈ 1, . . . , T, (1) Ef1,...,ft,w1,...,wt[ft(w)] ≤

f(w) for all w ∈ S; and (2)Ef1,...,ft,w1,...,wt−1|wt[ft(wt)] = f(wt). If r is drawn uniformly

at random from1, . . . , T, then

ErEf1,...,ft,w1,...,wT[f(wr)− f(w∗)] ≤ RT

T. (7.9)

Informally, this result provides an estimate of the averagesuboptimality Pegasos ob-

tains in terms of the existing regret bound for its underlying algorithm for convex games.

We note that a version of the theorem replacingwr with w = 1T

∑Tt=1 wt follows eas-

ily from Jensen’s inequality. Though this leads to a potentially more stable version of

Algorithm 3, the resulting algorithm in practice often converges less quickly and may be

less computationally efficient to implement (for problems where the feature vectorsxi are

sparse).

Using Markov’s inequality, it turns out that convergence inexpected suboptimality im-

plies convergence to optimality with high probability in the following sense:

Proposition 2. Let δ ∈ (0, 1). Under the conditions above, with probability at least1− δ,f(wr)− f(w∗) ≤ RT

δT.

Now, we turn to the task of converting Algorithm 2 into an SVM solver, in the same

manner as Pegasos. This time, we again assume that the functionsf1, . . . , fT are sam-

pled as in the same manner for the Pegasos algorithm, and for now, we assume the same


settings of the constantsλt andGt. We now analyze the efficiency of our optimization

algorithm by characterizing the number of iterations needed to guaranteeǫ-optimality with

high probability.

Using Corollaries 2 and 3 from Section 7.3.2, and applying Proposition 2 we have

• With probability at least1− δ, f(wr)− f(w∗) ≤ G2(1+log T )δλT

. To ensure that the right

hand side is no greater thanǫ requiresT ≥ O( G2

δλǫ) iterations.

• With probability at least1− δ, f(wr)− f(w∗) ≤ 4RGδ√

T. To ensure that the right hand

side is no greater thanǫ requiresT ≥ 16R2G2

δ2ǫ2iterations.

In the first bound, we recover theO( G2

δλǫ) convergence rate of the Pegasos algorithm. In

the second bound, we recover theO(R2G2

δ2ǫ2) rate of Zinkevich (2003), that, at least at first,

appears not to depend onλ, suggesting that perhaps the proximal algorithm ought to give

improved convergence whenλ is small. On a closer examination, however, the dependence

on λ is “hidden” inside theR = 1√λ

bound from the Pegasos algorithm. Making this

dependence explicit, we achieve a rate of onlyO( G2

δ2λǫ2). These results are not particularly

surprising, given the recent minimax analysis of Abernethyet al. (2008), who showed that

under certain assumptions, the regret bound of the regular projected subgradient algorithm

is worst case optimal.

Here, the weak link in our analysis is the dependence ofR on λ. In practice, how-

ever, the boundR = 1√λ

is often quite loose. Knowing ahead of time the norm ofw∗ =

arg minw f(w) would help by allowing us to define a smaller feasible setS and thus obtain

tighter bounds.

7.3.4 An optimistic strategy

With the above motivation in mind, we propose the adaptive strategy shown in Algorithm 3.

In this method, we assume that we are initially given some desired level of suboptimalityǫ.

Optimization proceeds in several phases. At the beginning of each phase, we “hypothesize”

a setting ofR. During each phase, we run the proximal projected subgradient strategy until

either (1)‖wt‖ gets “close” toR, forcing us to increaseR by a factor of√

2 and start

a new phase; or (2) enough iterations pass without this occurring, allowing us to declare

7.4. BATCH PROXIMAL LEARNING 123

convergence. The algorithm is “optimistic” in the sense that it initially assumesR to be

small and only increases it as necessary. One can prove that:

Lemma 2. Suppose that some particular phase ends without any increasein R. Define

w∗ = arg minw∈S f(w). Letr be chosen uniformly at random from1, . . . , T. Then with

probability at least1− δ, wr is ǫ-optimal, i.e.,f(wr)− f(w∗) ≤ ǫ.

Theorem 4. For ǫ < 12, Algorithm 3 terminates after processing at mostO( G2

δλǫ) examples;

with probability at least1− δ, the resulting parameterswr will be ǫ-optimal.

Our analysis thus shows that the modified algorithm, in the worst case, is asymptotically

equivalent to Pegasos up to logarithmic factors. In practice, however, Algorithm 3 can be

significantly faster when‖w∗‖ ≪ 1√λ. In these cases, the algorithm will tend to operate

in the regime of smallR, and will achieveO(R2G2

δ2ǫ2) regret, independent ofλ. (As an

anecdotal example, on the “combined” dataset in our experiments, the parameter norm

bound corresponding to theλ which gave the best test set performance was1√λ≈ 3× 105,

whereas‖w∗‖ = 12.94. On this run, the proximal algorithm estimated an upper bound of

R = 16.)

7.4 Batch proximal learning

In the batch learning setting, we are no longer presented with a sequence of objective

functions but rather we are given a singleλ-strongly convex objective functionf : Rn → R

that we would like to optimize. Batch algorithms are often appropriate when the training

set is not particularly large, but the cost of inference withrespect to any individual training

example is high. This type of scenario occurs frequently in structured prediction problems,

where inference may involve either a computationally expensive dynamic programming

step, or even solving a combinatorial optimization problemas a subroutine.

The prototypical batch algorithm from which we start is the cutting plane optimization

method of Joachims (2006) as reformulated and generalized in Teo et al. (2007) and Smola

et al. (2007). In this method,f is assumed to be everywhere nonnegative, and one creates


Dataset mtrain mtest n λbest Best training loss Eff. iterations to convergencePegasos Adaptive Proximal Pegasos Adaptive Proximal

a9a 32,561 16281 123 10−4 0.3537 0.3531 0.3533 28 19 18combined 78,823 19705 100 10−9 0.5299 0.2760 0.2336 100 99 8connect-4 54,045 13512 126 10−7 6.8229 0.9698 0.5136 n/a 99 63covtype 464,808 116204 54 10−8 1.4852 0.7217 0.5830 n/a 96 12ijcnn1 35,000 91701 22 10−7 0.3582 0.2088 0.1857 89 98 3mnist 60,000 10000 780 10−5 0.1200 0.1033 0.1012 75 28 3rcv1 20,242 677399 47,236 10−7 0.0084 0.0035 0.0487 53 10 83

real-sim 57,846 14463 20,958 10−5 0.0602 0.0602 0.0602 6 5 7w8a 49,749 14951 300 10−8 1.5146 0.1391 0.1292 n/a 45 13

Table 7.1: Convergence ofPegasos, Adaptive, andProximalon nine binary classificationtasks. The second through fifth columns give the size of the training and testing sets,number of features, and the optimal regularization parameter. The last two sets of threecolumns report the best SVM training loss,f = mint∈1,...,T f(wt), seen for each testedalgorithm, and the number of iterations needed to reduce theinitial objective function by0.99(f(w1) − f). n/a is reported for cases where the optimizer failed to find abetterobjective than the starting parameter set. The best numbersin each group are shown inbold.

a sequence of lower-bound approximations tof of the form,

Pt(w) =λ

2‖w‖2 + max

(

0, maxi∈1,...,t

(

aT

i w + bi)

)

.

Initially, w1 = 0. During each iterationt ∈ 1, . . . , T, at and bt are chosen so that

aT

t w + bt is the first-order Taylor expansion ofℓ(w) at wt, andwt+1 is chosen to be the

minimizer ofPt. To date, the best convergence results known for bundle methods state that

at mostO( 1λT

) iterations are needed to achieveǫ-optimality, as proved in Teo et al. (2007)

and Smola et al. (2007). However, whenλ ≈ 0, the number of iterations needed can still

be very large, just as in the online case.

To counter these problems, we propose aproximal bundle method, as shown in Algo-

rithm 4. In particular, consider the sequence of primal and dual optimization problems,

minw∈Rn

P ′t(w) for t = 1, 2, . . . , T (7.10)

maxα∈Rt:α0,αT1≤t

D′t(α) for t = 1, 2, . . . , T (7.11)

7.4. BATCH PROXIMAL LEARNING 125

where

P ′t(w) := t · Pt(w) +

t∑

i=1

τi2‖w −wi‖2 (7.12)

D′t(α) :=

t∑

i=1

(τi2‖wi‖2 + αibi

)

−∥

∥

∑ti=1(τiwi − αiai)

∥

∥

2

2(λt+ τ1:t). (7.13)

for some constantsτ1, . . . , τT , and wherewt+1 := arg minw∈Rn Pt(w). If we defineαt =

arg maxα∈Rt:α0,αT1≤tD′t(α), then the two optima are related by

wt+1 =

∑ti=1(τiwi − αt,iai)

λt+ τ1:t(7.14)

using strong duality.

In most cutting plane analyses, convergence rates are established by lower-bounding

the dual improvement in each iteration, and then arguing that only a limited number of iter-

ations can occur before some termination criteria (e.g., primal-dual gap) is satisfied. Here,

we again use the dual improvement argument, though we obtainsomewhat different results,

given that the dual objective function changes after each iteration due to the changing prox-

imal regularization terms. Our analysis is closely relatedto the online learning framework

of Kakade and Shalev-Shwartz (2008).

Lemma 3. Letw1, . . . ,wt−1 ∈ Rn andα ∈ R

t−1 be vectors such thatα 0 andαT1 ≤t− 1. If we definewt :=

Pt−1i=1 (τiwi−αiai)

λ(t−1)+τ1:t−1, then

D′t([α; 1])−D′

t−1(α) = f(wt)−‖λwt + at‖22(λt+ τ1:t)

. (7.15)

Using this lower bound, we can then bound the best suboptimality obtained by our

algorithm aftert steps:

Proposition 3. Letw∗ = arg minw∈Rn f(w). Suppose that‖wt‖ ≤ R and‖at‖ ≤ At for

t = 1, . . . , T . Then,


mint∈1,...,T

f(wt)− f(w∗) ≤ 1

T

T∑

t=1

[

2τtR2 +

(λR + At)2

2(λt+ τ1:t)

]

.

Remarkably, the suboptimality guarantees in the proposition above have essentially the

same form as the regret bounds stated in Corollary 1. As a result, we can make use of

the balancing heuristic for choosing the proximal constants τ1, . . . , τT . Furthermore, the

optimistic strategy for bounding the optimal parameter norm, as described in Section 7.3.4,

also carries over with little modification. For the sake of space, we show only the proximal

bundle method using the balancing heuristic in Algorithm 4;we do not give pseudocode

for the optimistic extension explicitly. Using Proposition 3 and the argument in Theorem 1,

we have the following theorem,

Theorem 5. Suppose that‖wt‖ ≤ R and‖at‖ ≤ A for t = 1, . . . , T . Then, Algorithm 4

achieves,

mint∈1,...,T

f(wt)− f(w∗) ≤ (λR + A)2(1 + log T )

λT.

Provided thatλR+A = O(1), then our analysis yields a worst case convergence rate of

O( 1λT

), matching the convergence rate of our online algorithm, as well as the best known

convergence rates for bundle methods.

We note that the idea of stabilizing standard bundle method algorithms to improve con-

vergence has been suggested previously in the bundle methodliterature. Proximal bundle

methods originated with Kiwiel (1983), and are closely related to trust region Schramm and

Zowe (1992) and level set Lemarechal et al. (1995) techniques for bundle method improve-

ment. In practice, each of these prior methods require considerable parameter tuning on

the part of the user. In contrast, our bundle algorithm is straightforward, with the curvature

termsτt automatically chosen in order to minimize a regret bound.

7.5. EXPERIMENTS 127

7.5 Experiments

We carried out two sets of experiments with proximal algorithms. For the first set of tasks,

we tested online algorithms for large-scale binary classification. For the second set of tasks,

we performed batch training of structured output SVMs for RNAfolding and web ranking.

In both the online and batch cases, we ran the optimistic version of our proximal algorithm,

settingǫ = 0 andδ = 1, stopping after a fixed number of iterations, and returningwT+1

instead ofwr, as in Shalev-Shwartz et al. (2007).

7.5.1 Online learning with binary classification

In this experiment, we tested the behavior of our algorithm on nine binary classification

datasets (http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/). For each dataset where

a binary classification version was not available, we reduced multiclass to a single class

vs. rest problem. When separate testing sets were not available, we reserved 80% of the data

set for training and 20% for testing. For each of these datasets, we first determined the op-

timal setting ofλbest for ensuring good generalization performance using cross-validation.

We then compared theProximal online algorithm againstPegasosShalev-Shwartz et al.

(2007) andAdaptiveonline gradient descent Bartlett et al. (2007) by running each algo-

rithm for 100 effective iterations (where one pass through the entire dataset is an effective

iteration) under a variety of regularization parameter settings.

In Table 7.1, we provide some statistics on the training and testing datasets used. We

record the best objective valuef = mint∈1,...,T f(wt) obtained for each algorithm over the

100 effective iterations. We also record the number of iterations needed to achieve a ob-

jective function reduction of0.99(f(w1)− f) for each algorithm. TheProximalalgorithm

achieves the best objective on 7 out of 9 datasets, while consistently requiring few effective

iterations.

Figures 7.1 and 7.2 show learning curve plots and test error plots for two of the datasets

(combined and covtype). As shown, the proximal algorithm enjoys a comfortable advan-

tage over the other methods, especially for smallλbest.


100

101

102

10−0.6

10−0.4

10−0.2

Reg

ular

ized

trai

ning

loss

λ = 0.0001

Iterations

PegasosAdaptiveProximal

100

101

102

100

Reg

ular

ized

trai

ning

loss

λ = 1e−005

Iterations


100

101

102

10−2

100

102

104

Reg

ular

ized

trai

ning

loss

λ = 1e−006

Iterations


100

101

102

10−0.2

100

100.2

Reg

ular

ized

trai

ning

loss

λ = 1e−006

Iterations


100

101

102

10−1

100

101

102

Reg

ular

ized

trai

ning

loss

λ = 1e−007

Iterations


100

101

102

10−2

100

102

104

Reg

ular

ized

trai

ning

loss

λ = 1e−008

Iterations


Figure 7.1: Convergence ofPegasos, Adaptive, andProximal for combined and covtype.Left column: combined; Right column: covtype. Each row corresponds to a regularizationparameterλ. Bottom row: λ = λbest, middle row: λ = 10λbest, top row: λ = 100λbest.Effective iterations are shown on thex-axis.

7.5. EXPERIMENTS 129

100

101

102

10−2

10−1

100

Cla

ssifi

catio

n er

ror

λ = 0.0001

Iterations


100

101

102

10−2

10−1

100

Cla

ssifi

catio

n er

ror

λ = 1e−005

Iterations


100

101

102

10−2

10−1

100

Cla

ssifi

catio

n er

ror

λ = 1e−006

Iterations


100

101

102

10−0.6

10−0.5

Cla

ssifi

catio

n er

ror

λ = 1e−006

Iterations


100

101

102

10−0.6

10−0.5

10−0.4

Cla

ssifi

catio

n er

ror

λ = 1e−007

Iterations


100

101

102

10−0.6

10−0.5

10−0.4

Cla

ssifi

catio

n er

ror

λ = 1e−008

Iterations


Figure 7.2: Test errors ofPegasos, Adaptive, and Proximal for combined and covtypeduring the course of optimization. Each row corresponds to aregularization parameterλ.Bottom row:λ = λbest, middle row:λ = 10λbest, top row:λ = 100λbest. Effective iterationsare shown on thex-axis.


7.5.2 Batch learning with RNA folding and web ranking

In this experiment, we compared our batch proximal learningalgorithm against standard

bundle algorithms Smola et al. (2007) for learning RNA folding and web search rank-

ing models. Both of these problems can be formulated as nonsmooth structured SVMs

(Chapelle et al. (2007) for ranking and Do et al. (2006b) for RNAfolding). To date, the

fastest approaches for dealing with this type of nonsmooth optimization problem are cutting

plane/bundle methods (e.g., SVMPerf Joachims (2006) and BMRMTeo et al. (2007)).

In the RNA folding experiment, the dataset contained RNA sequences taken from 151

separate RNA families Do et al. (2006b), and we used a model with approximately 350 dis-

tinct features based largely on existing thermodynamic scoring schemes for RNA folding.

In the ranking experiment, the dataset contained 1000 queries for training, 1000 queries

for validation, with an average of 50 documents per query. Inboth cases, we compared the

performance of the proximal bundle method against the standard bundle method for various

values ofλ.

Figure 7.3 shows training loss curves depicting the best training loss obtained so far

for a standard bundle method compared to our proximal variant. In both methods, many

iterations pass before the algorithms are able to identify parameters which improve upon

the initial parameter set; for the standard bundle method, this problem is especially pro-

nounced for small regularization parameters. Again, the results show that the proximal

variant significantly outperforms the standard algorithm,especially whenλ is small.

7.6 Discussion

Functions with low curvature are the Achilles’s heel of optimization algorithms in machine

learning. In this chapter, we proposed new online and batch learning algorithms, which

sequentially modify the objective functions used during optimization. By choosing these

modified tasks carefully, our methods ensure that (1) the sequence of solutions given by

these modified tasks will lead to a good approximate minimizer of the original optimization

problem, and (2) the regret bounds obtained in the online setting and the convergence rates

obtained in the batch setting will be improved due to the increased curvature.

7.6. DISCUSSION 131

100

101

102

103

102.15

102.16

iterations

regu

lariz

ed tr

aini

ng lo

ss λ = 10

BundleProximal Bundle

100

101

102

103

102.153

102.167

iterations

regu

lariz

ed tr

aini

ng lo

ss λ = 1


100

101

102

103

102.154

102.167

iterations

regu

lariz

ed tr

aini

ng lo

ss λ = 0.1


100

101

102

103

102.152

102.167

iterations

regu

lariz

ed tr

aini

ng lo

ss λ = 0.01


100

101

102

103

10−3.17

10−3.15

iterations

regu

lariz

ed tr

aini

ng lo

ss λ = 1e−05

Bundle

Proximal Bundle

100

101

102

103

10−3.17

10−3.15

iterations

regu

lariz

ed tr

aini

ng lo

ss λ = 1e−06

Bundle

Proximal Bundle

100

101

102

103

10−3.17

10−3.15

iterations

regu

lariz

ed tr

aini

ng lo

ss λ = 1e−07

Bundle

Proximal Bundle

100

101

102

103

10−3.17

10−3.15

iterations

regu

lariz

ed tr

aini

ng lo

ss λ = 1e−08

Bundle

Proximal Bundle

Figure 7.3: Comparison of a standard bundle method to proximal bundle method for SVMstructured learning with various choices ofλ. Left column: RNA folding; Right column:Web ranking. Each row corresponds to a regularization parameterλ. The regularizationparameter decreases from top to bottom.


The idea of adding curvature in order to improve regret bounds for online algorithms

was introduced in Bartlett et al. (2007), and the online algorithmic schemes proposed there

have much in common with the basic online methods proposed here. We apply these ideas

to the problem of training linear SVMs and structured prediction models, where we intro-

duce a new adaptive strategy for optimistically bounding the norm of the optimal parame-

ters. We also transfer these ideas to the batch setting, where we present improved bundle

methods for structured learning.

Experimentally, we show that the problem of low curvature isnot simply a matter of

theoretical concern. Rather, for many real world large-scale learning problems, the optimal

regularization penalty (as determined by holdout cross-validation) is often very small. For

problems where high regularization is appropriate (e.g., when the dimensionality of the

data is large relative to the number of training examples), our algorithm performs as well

as the best existing methods, such as Pegasos. When low regularization is needed, however,

our algorithm offers dramatic improvements over state-of-the-art techniques, converging in

a few passes through the dataset when other algorithms may fail to converge at all.

7.6. DISCUSSION 133

Algorithm 3 Optimistic proximal SVM solver

input Training set(x(i), y(i))mi=1

Regularization parameterλDesired suboptimalityǫAllowed failure probabilityδMini-batch sizek

DefineS := w ∈ Rn : ‖w‖ ≤ 1/

√λ.

SetG← maxi

∥

∥x(i)∥

∥+√λ.

Setδ ← δ3−log2 λ

.

InitializeR← min(1, 1√λ).

Initialize w1 ← 0.repeat

Set CONVERGED← true.Find smallestT such thatmin(G2(1+log T )

δλT, 4RG

δ√

T) ≤ ǫ.

for t← 1, . . . , T doSampleAt ⊆ 1, . . . ,m such that|At| = k.Defineft(w) according to (7.8).Choosegt ∈ ∂ft(wt).

Setτt ←−λ1:t−τ1:t−1+

q

(λ1:t+τ1:t−1)2+G2

R2

2.

Setηt ← 1/(λ1:t + τ1:t).Setwt+1 ← ΠS[wt − ηtgt].

if ‖wt+1‖ ≥ R−√

2ǫλ

then

SetR←√

2R.Set CONVERGED← false.break

end ifend for

until CONVERGED

Chooser uniformly at random from1, . . . , T.return wr.


Algorithm 4 Proximal bundle methodInitialize w1 ← 0.for t← 1, . . . , T do

Chooseat ∈ ∂ℓ(wt).Setbt ← ℓ(wt)− aT

t wt.

Setτt ←−λt−τ1:t−1+

r

(λt+τ1:t−1)2+(λR+At)

2

R2

2.

Computeαt = arg maxα∈Rt:α0,αT 1≤tD′t(α).

Setwt+1 =Pt

i=1(τiwi−αt,iai)

λt+τ1:t.

end forreturn wT+1.

Chapter 8

Conclusion

The development of computational approaches for analyzingbiological sequences raises a

number of challenges, ranging from problem definition to model building and inference.

In this thesis, we focused on the task of estimating structured models from empirical data.

Specifically, we examined three key topics in computationalbiology—protein sequence

alignment (Chapter 3), RNA secondary structure prediction (Chapter 4), and RNA simul-

taneous alignment and folding (Chapter 6)—for which we demonstrated that appropriately

designed and trained structured models can match or exceed state-of-the-art accuracy. In

addition, we developed techniques for improving the effectiveness of learning algorithms

for structured models, both by enhancing optimization efficiency (Chapter 7) and by reduc-

ing overfitting through regularization penalty design (Chapter 5).

In 2004, when the work described in this thesis began, structured discriminative models

had not yet caught significant attention in the computational biology community. Con-

ditional random fields, though somewhat more familiar in natural language processing

(NLP) (Sutton and McCallum, 2007), had only a few applications to computational bi-

ology in the domains of RNA structural alignment (Sato and Sakakibara, 2005), protein

fold recognition (Liu et al., 2006), and in a limited fashion, to gene prediction (Culotta

et al., 2005). Early work with max-margin models for sequence alignment focused pri-

marily on mathematical feasibility rather than practical application to real biological se-

quences (Joachims et al., 2005). Within this context, the work described in this thesis thus

135

136 CHAPTER 8. CONCLUSION

represents some of the earliest and arguably most successful applications of structured dis-

criminative techniques to real-world problems in computational biology. For each of the

three computational biology tasks, we developed practicaltools with state-of-the-art accu-

racy; each of these tools are available as open source software, and a webserver can be

found online at http://contra.stanford.edu/.

Over the last several years, discriminative structured models have faced a tremendous

growth in popularity in computational biology applications. As of today, discriminative

modeling techniques have profitably been applied to a wide range of biological problems,

including gene prediction (DeCaprio et al., 2007; Gross et al., 2007; Bernal et al., 2007),

spliced sequence alignment (Schulze et al., 2007), and disulfide bond prediction (Taskar

et al., 2005), among others. In this thesis, we addressed many of the problems associated

with applying discriminative modeling to practical biological tasks, such as large-scale

optimization and overfitting prevention. A number of significant challenges, however, still

remain:

1. In order to apply discriminative learning techniques forstructured prediction, one

must start out with examples of input sequences and known desired outputs. The

assumption that suitable training examples exist, however, is often either untenable or

too restrictive. Experimental techniques for interrogating RNA secondary structure

do not typically directly reveal the full secondary structure of the RNA molecule but

rather only identify specific constraints on the space of possible secondary structures.

For example, the technique of hydroxyl radical footprinting (Wang and Padgett,

1989) reveals the presence of “unprotected” regions of RNA that are not involved

in base-pairing without revealing base-pairs directly. These constraints, combined

with manual analysis, allow RNA biologists to deduce the mostprobable structure of

the RNA. Statistical learning algorithms capable of utilizing this type of data would

potentially be able to draw on far larger resources for training than existing database

of fully known structures.

2. One major confounding factor in computational biology applications is the non-

independence of training set examples. In the problem of protein sequence align-

ment, for example, while the number of reference alignmentsthat can be generated

137

from databases is effectively limitless given that the space of known proteins is so

large, many of these alignments will share proteins that areevolutionarily related

at some level. To ensure independence of training examples,a common strategy is

to pick a threshold for the maximum allowed pairwise similarity between sequences

from any two separate training examples in the training set;if this threshold is chosen

to be too low, then the training examples cannot be regarded as truly independent, but

if the threshold is too high, then the training set size may bedrastically reduced.

One possible solution is to accept the reduction in trainingdata and focus on con-

structing carefully regularized learning algorithms wellsuited to the low data regime

(see Chapter 3). Another possibility is to ignore non-independence during training

time, and hope that the effects of non-independence will be outweighed by the bene-

fits in increasing training data. More speculatively, one might imagine that an appro-

priately designed relational learning technique, which deals with non-independent

training examples by explicitly modeling their dependence, could provide a different

formalism for addressing this difficulty (Getoor and Taskar, 2007).

3. In some computational biology applications, performinginference for a given statis-

tical model is a computationally difficult problem on its own. In the RAF program

(see Chapter 6), we dealt with the computational intractability of exact inference in

the simultaneous alignment and folding problem by sacrificing the probabilistic in-

terpretation of our model and relying on heuristics in orderto reduce complexity. In

other real world problems, however, such tricks are not always available, in which

case inference itself can be a limiting factor in running a learning algorithm. Flannick

et al. (2008), for instance, consider the problem of proteinnetwork alignment, where

the inference task is a generalization of the subgraph isomorphism problem, known

to be NP-complete (Garey and Johnson, 1979). Here, heuristic approximation tech-

niques are required for inference; understanding the effects of using this approximate

inference on learning is still an active area of research (Kulesza and Pereira, 2008;

Wainwright, 2006; Finley and Joachims, 2008).

4. Models learned using a discriminative statistical learning algorithm, while useful

for generating high accuracy predictions, do not always yield easily interpretable

138 CHAPTER 8. CONCLUSION

results. For example, in the RNA secondary structure prediction problem, param-

eters in the CONTRAfold model differ from those used in thermodynamics-based

models in that they cannot be interpreted as free energies directly (see Chapter 4).

Andronescu et al. (2007) explored the construction of hybrid max margin-like mod-

els that entertain the advantages of statistical models while retaining thermodynamic

interpretability. Though the results presented in that work seem to indicate that such

hybrid models fare worse in terms of generalization performance when using compa-

rable amounts of training data compared to the CONTRAfold technique, this effect

is less pronounced at larger training set sizes and nonetheless provide a useful first

step towards combining the power of statistical and thermodynamics models.

Each of these challenges provides an opportunity for further exploration. With the rapid

growth of interested in structured methods in computational biology, it would be hardly

surprising if effective solutions are found for some or all of these problems within the next

decade.

In closing, we recall a passage from the introductory chapter of the bookConvex Op-

timization, where Boyd and Vandenberghe (2004) discuss the position of least squares op-

timization problems in the pantheon of convex optimizationproblems, and in particular

focus on the practical role played by least squares methodologies from the perspective of

an end-user. They write:

[. . . ] in the vast majority of cases, we can say that existing [least-squares]

methods are very effective, and extremely reliable. Indeed, we can say that

solving least-squares problems (that are not on the boundary of what is cur-

rently achievable) is a (mature)technology, that can be reliably used by many

people who do not know, and do not need to know, the details.

(Boyd and Vandenberghe, 2004, chap. 1)

Today, discriminative structured model estimation methods in computational biology, un-

like least-squares problems in optimization, are far from being a technology. Besides bi-

ological domain knowledge, significant machine learning expertise is still required in or-

der to construct models and algorithms that appropriately leverage the power of statistical

139

learning; indeed, the “end-user” of discriminative learning methods (i.e., computational

biologist algorithm developers) must be sufficiently savvyin order to get his or her dis-

criminative learning to behave properly.

Nonetheless, when statistical techniques are appropriately applied, and when effective

algorithms for optimization and managing complexity of these models are devised, then the

resulting methods are more than competitive compared to existing alternative methods. As

the challenges above show, our understanding of the full potential of structured machine

learning approaches and their proper application to problems in computational biology is

still in its infancy. The road leading from the frontiers of research to the elevated status

of “technology” is long and arduous, but the work presented in this thesis, we hope, takes

some of the initial steps along the way.

Appendix A

Appendix for Chapter 4

A.1 Preliminaries

In this appendix, we describe in full the structured conditional log-linear model (structured

CLLM) used in the CONTRAfold program. We also provide detailed pseudocode explicitly

showing the dynamic programming recurrences needed to reproduce the CONTRAfold

algorithm, specifically CONTRAfold version 1.10.

Let Σ = A,C,G,U,N be an alphabet, and consider some stringx ∈ ΣL of lengthL.

In the RNA secondary structure prediction problem,x represents an unfolded RNA string,

andxi refers to theith character ofx, for i = 1, . . . , L. For ease of notation, we say that

there areL + 1 positionscorresponding tox—one position at each of the two ends ofx,

andL− 1 positions between consecutive nucleotides ofx. We will assign indices ranging

from 0 toL for each position. This is illustrated in Figure A.1 (reproduced here for clarity).

LetY be the space of all possible structures of a sequencex. Structured conditional log-

linear models (structured CLLMs) define the conditional probability of a structurey ∈ Ygiven an input RNA sequencex as

P (y | x;w) =exp(wTF(x, y))

∑

y′∈Y exp(wTF(x, y′))(A.1)

=1

Z(x)· exp(wTF(x, y)) (A.2)

140

A.2. BASIC FEATURE SET 141

.A.G.A.G.A.C.U.U.C.U.

position 0 positionLposition 4 position 5

nucleotide 5

Figure A.1: Positions in a sequence of lengthL = 10.

whereF(x, y) ∈ Rn is ann-dimensional vector of feature counts describingx andy, w ∈

Rn is ann-dimensional vector of parameters, andZ(x) (known as thepartition functionof a

sequencex) is a normalization constant ensuring thatP (y | x;w) forms a legal probability

distribution over the space of possible structuresY. In this representation, the “weight”

associated with a structurey for a sequencex is exp(wTF(x, y)). Because thelogarithm

of the weight is alinear function of the featuresF(x, y), this is typically known as the

log-linear representation of a CRF.

Now, consider the following reparameterization of (A.2). For each entrywi of w, define

φi = exp(wi). It follows that (A.2) may be rewritten as

P (y | x;w) =1

Z(x)·

n∏

i=1

φFi(x,y)i (A.3)

whereFi(x, y) is theith component ofF(x, y). In this alternative representation, the weight

associated with a structurey for a sequencex is a product,∏n

i=1 φFi(x,y)i . We refer to this

as thepotentialrepresentation of a CRF, where each parameterφi is called apotential.

In Figure A.2, we list all of the potentialsφi involved in scoring a structurey. Then,

in Section A.2, we define the feature countsFi(x, y) for a sequencex and its structure

y. Finally, in the remaining sections, we describe the dynamic programming recurrences

needed to perform inference using our probabilistic model.

A.2 Basic feature set

In this section, we define the feature countsFi(x, y) for a sequencex and a structurey.

One way to do this is to give, for each potentialφi shown in Figure A.2, a formulaexplicitly

142 APPENDIX A. APPENDIX FOR CHAPTER 4

φhairpin base whairpin length[·] φhelix base pair(·, ·)φhairpin extend whelix change[·] whelix closing(·, ·)whelix extend φbulge length[·] φsingle base pair stacking left((·, ·), ·)wmulti base φinternal length[·] φsingle base pair stacking right((·, ·), ·)wmulti unpaired φinternal asymmetry[·] wterminal mismatch((·, ·), ·, ·)wmulti paired φinternal full[·][·] φhelix stacking((·, ·), (·, ·))

Figure A.2: List of all potentials used in the CONTRAfold model.

specifying how to compute the corresponding featureFi(x, y).

Here, we will instead define feature countsimplicitly by

1. decomposing a secondary structurey into four fundamental types of substructures:

hairpins, single-branched loops, helices, and multi-branched loops;

2. defining afactor1 for each type of substructure as a product of potentials fromFig-

ure A.2;

3. defining the product∏n

i=1 φFi(x,y)i as a product of factors for each substructure iny.

By specifying which potentials are included in the computation of the factor for each type

of substructure, we thus define the feature countsFi(x, y) implicitly as thenumber of

times each potentialφi is used in the product of factors for a structurey.

A.2.1 Hairpins

A hairpin is a loop with only one adjacent base pair, known as its closing base pair. For

1 ≤ i ≤ j < L, we say that a hairpin spans positionsi to j if xi andxj+1 form the

closing base pair (see Figure A.3). For hairpins, energy-based secondary structure folding

algorithms such as Mfold assign free energy increments for each of the following:

• energies corresponding to the length of the loop (i.e., a hairpin spanning positionsi

to j has lengthj − i),1To be clear, afactor is simply a collection of potentials that are associated with the presence of a particular

secondary structure subunit in a structurey. For example, the factor associated with a hairpin loop is simplythe product of the parameter potentials which are involved in “scoring” the hairpin loop.


• terminal mismatch stacking energies as a function of the closing base pair(xi, xj+1)

and the first unpaired nucleotides in the loop,xi+1 andxj,

• bonus free energies for loops containing specific nucleotide sequences, and

• other special cases.

CONTRAfold uses a simplified scoring model for hairpins which ignores the latter two

cases. In particular, the factorϕhairpin(i, j) of a hairpin spanning positionsi to j is

ϕhairpin(i, j) =

wterminal mismatch((xi, xj+1), xi+1, xj)

·

whairpin length[j − i] if 0 ≤ j − i ≤ 30

φhairpin base· (φhairpin extend)ln(j−i) if j − i > 30.

(A.4)

In the above expression, the first term accounts for terminalmismatches arising from the

fact that(xi, xj+1) are paired, butxi+1 andxj are not.2 The second term scores the hairpin

based on its length. For loops under size 30, potentials are read directly from a table.

For longer loops, the factor above directly imitates typical energy-based scoring schemes,

which estimate the free energy increment of a loop of lengthj − i as

a+ b · ln(j − i), (A.5)

for fixed constantsa andb. By analogy, we have

φhairpin base· (φhairpin extend)ln(j−i)

= exp(ln(φhairpin base) + ln(φhairpin extend) · ln(j − i)) (A.6)

= exp(a′ + b′ · ln(j − i)) (A.7)

2Here, note that the order of the arguments is important so as to ensure that the parameters are invariantwith respect to the orientation of the substructure. For example, we expect the parameter forAG stacking ontop ofCU to be the same as the parameter forUC stacking on top ofGA.


.A.G.A| | |.U.C.U

.

.

A

G

.

.

A

U

.

.

G

A.

nucleotidei

nucleotidej + 1

positioni

positionj

Figure A.3: A hairpin loop of length 6 spanning positionsi to j.

.A.G.A| | |.U.C.U

.

.

A.

G

A.

.| | |G

U

.

.

G

C

.

.

U

A

.

.

.

.

.

.

nucleotidei

nucleotidej + 1

positioni

positionj

positioni′

positionj′

nucleotidei′ + 1

nucleotidej′

Figure A.4: A single-branched (internal) loop of lengths 2 and 1 spanning positionsi to i′

andj′ to j. Here,A-U is the external closing base pair andG-U is the internal closing basepair.

where

a′ = ln(φhairpin base) (A.8)

b′ = ln(φhairpin extend). (A.9)

A.2.2 Single-branched loops

A single-branched loop is a loop which has two adjacent base pairs. The outermost base

pair is called theexternal closing base pairwhereas the innermost base pair is called the

internal closing base pair. Suppose1 ≤ i ≤ i′ andj′ ≤ j < L. We say that a single-

branched loop spans positionsi to i′ andj′ to j if xi andxj+1 form the external closing

base pair andxi′+1 andxj′ form the internal closing base pair. To ensure that the internal


closing base pair is well-defined, we require thati′ + 2 ≤ j′ (see Figure A.4).

A single-branched loop for whichi′ = i andj = j′ is called astacking pair. A single-

branched loop for which eitheri′ = i or j = j′ (but not both) is called abulge. Finally, a

single-branched loop for which bothi′ > i andj > j′ is called anℓ1 × ℓ2 internal loop,

whereℓ1 = i′− i andℓ2 = j−j′. For now, we will treat the problem of only scoring bulges

and internal loops; we consider the scoring of stacking pairs separately in the next section.

Energy-based scoring methods typically score internal loops and bulges by accounting

for the following:

• energies based on the total loop length,ℓ1 + ℓ2,

• energies based on the asymmetry in sizes of each side of the loop, |ℓ1 − ℓ2|,• special corrections for highly asymmetric1× ℓ (or ℓ× 1) loops

• terminal mismatch stacking energies for the external closing base pair(xi, xj+1) and

its adjacent nucleotides in the loop,xi+1 andxj,

• terminal mismatch stacking energies for the internal closing base pair(xj′ , xi′+1) and

its adjacent nucleotides in the loop,xj′+1 andxi′ , and

• specific free energy increments for1× 1, 1× 2, and2× 2 interior loops as a function

of their closing base pairs and the nucleotides in the loop.

For computational tractability, many programs such as Mfold limit total loop lengths of

single-branched loops to a small constantc (typically, c = 30).

In CONTRAfold, the total loop length, loop asymmetry, and terminal mismatch stack-

ing interaction terms are retained. The special corrections for asymmetric interior loops

are replaced with a more general two-dimensional table for scoringℓ1 × ℓ2 interior loops.

Finally, the large lookup tables which exhaustively characterize the energies of all1 × 1,

1× 2, and2× 2 interior loops are omitted.

Specifically, for all1 ≤ i ≤ i′ and j′ ≤ j ≤ L − 1 such thati′ + 2 ≤ j′ and


.A.G.A.C.U.G

.U.C.U.G.A.C| | | | | |

.

.

A

G

.

.

.

.

.

.

positioni

positionj

nucleotidei+ 1

nucleotidej

nucleotidei+ ℓ

nucleotidej − ℓ+ 1

Figure A.5: A helix of lengthℓ = 6 spanning positionsi to j.

1 ≤ i′ − i+ j − j′ ≤ c, the factorwsingle(i, j, i′, j′) for a bulge or internal loop is given by

wsingle(i, j, i′, j′) =

φbulge length[i′ − i+ j − j′] if i′ − i = 0 or j − j′ = 0

φinternal length[i′ − i+ j − j′] if i′ > i andj > j′

· φinternal asymmetry[|(i′ − i)− (j − j′)|]

· φinternal full[i′ − i][j − j′]

· wterminal mismatch((xi, xj+1), xi+1, xj)

· wterminal mismatch((xj′ , xi′+1), xj′+1, xi′). (A.10)

Like most energy-based methods, we usec = 30 for computational tractability.

A.2.3 Helices

A single-branched loop for whichi′ = i andj = j′ is known as astacking pair. A sequence

of one or more consecutive stacking pairs is called ahelix(or stem); informally then, a helix

consists of several consecutive nucleotides of an RNA molecule directly base pairing to a

set of consecutive nucleotides which appear later in the RNA sequence.

Now, consider a helix that matches nucleotidesxi+1xi+2 . . . xi+ℓ in a sequencex to

nucleotidesxj−ℓ+1xj−ℓ+2 . . . xj which appear later in the sequence. We say that this is a

helix of lengthℓ starting at positionsi andj. Nucleotidesxi+1 andxj form theexternal

closing base pairof the helix whereas nucleotidesxi+ℓ andxj−ℓ+1 form theinternal closing


base pair(see Figure A.5).

Traditional energy-based methods such as Mfold score helices using

• a sum of interaction terms for each stacking pair, and

• penalties for each non-GC terminal closing base pair.

Since stacking pair interaction terms are based on the nearest neighbor model, only Watson-

Crick and wobbleGU base pairs are allowed; other pairings are necessarily treated as small

symmetric interior loops.

CONTRAfold extends on traditional energy-based methods by including penalties for

all possible closing base pairs (not just the “canonical” pairings). CONTRAfold also

considers the interaction of every pair of bases in the stem rather than ignoring the non-

canonical/non-GU base pairs which are not found in the regular nearest neighbor energy

rules. Finally, CONTRAfold includes scores for helix lengths, allowing arbitrary scores

for helix lengths of at mostd (in practice, we setd = 5), and assigning affine scores for

helices of length greater thand.

In particular, for0 ≤ i ≤ i + 2ℓ + 2 ≤ j ≤ L, the factorwhelix(i, j, ℓ) for a helix of

lengthℓ starting ati andj is:

whelix(i, j, ℓ) =

whelix closing(xi+1, xj)

· whelix closing(xj−ℓ+1, xi+ℓ)

·ℓ∏

k=1

φhelix base pair(xi+k, xj−k+1)

·ℓ−1∏

k=1

φhelix stacking((xi+k, xj−k+1), (xi+k+1, xj−k))

·Whelix length(ℓ), (A.11)

where

Whelix length(ℓ) =

min(d,ℓ)∏

i=1

whelix change[i]

· (whelix extend)max(ℓ−d,0) . (A.12)


.A.G.A| | |.U.C.U.A.A | A

.

.A.A .A.

A. ...A.G.A|||...U.C.U

. .A | A. .A | A. .. .. .

positionj

positioni

positionj2

positioni1

positioni2positionj1

Figure A.6: A multi-branched loop spanning positionsi to i1, j1 to i2, andj2 to j.

In this formulation,whelix closing(xi, xj) scores the use of a particular base pair for clos-

ing a helix. Similarly,φhelix stacking((xi, xj), (xi+1, xj−1)) scores the interaction for stacking

(xi+1, xj−1) on top of(xi, xj). Finally, the helix length scoreWhelix length(ℓ) is designed so

that the length component of the score for any helix of lengthℓ ≤ d is given explicitly as

(whelix change[1]) · (whelix change[2]) · . . . · (whelix change[ℓ]), (A.13)

and helices of lengthℓ > d have a correction potential ofwhelix extend applied for each

additional base pair.

A.2.4 Multi-branched loops

A multi-branched loop is a loop containing at least three adjacent base pairs. More formally,

supposei ≤ i1 ≤ j1 ≤ i2 ≤ j2 ≤ . . . ≤ im ≤ jm ≤ j wherem ≥ 2 andik + 2 ≤ jk for

k = 1, . . . ,m. We say that a multibranch loop spans positionsi to i1, j1 to i2, . . . , andjm

to j if nucleotides(xi, xj+1) form the external closing base pair, and(xjk, xik+1) form the

internal closing base pairs fork = 1, . . . ,m (see Figure A.6).

Let the lengthℓ of a multi-branched loop be the number of unpaired bases,

ℓ = i1 − i+ j − jm +m∑

k=2

(ik − jk−1). (A.14)

For computational tractability, most programs score multi-branched loops using


• energy terms dependent on the length of the loop.

• single base pair stacking energies describing the attachment of each helix to the

multi-branched loop,

• coaxial stacking terms for helices on the multi-branched loop that are separated by at

most one unpaired position

CONTRAfold uses a similar scoring scheme for multi-branched loops which ignores

coaxial stacking. Specifically, if1 ≤ i ≤ i1 ≤ i1 + 2 ≤ j1 ≤ i2 ≤ . . . ≤ j ≤ L − 1, then

the factor associated with a multi-branched loop spanning positionsi to i1, j1 to i2, . . . , and

jm to j is

wmulti(i, j, i1, j1, . . . , im, jm) =

wmulti base· (wmulti unpaired)ℓ · (wmulti paired)

m+1

· ϕmulti mismatch((xi, xj+1), xi+1, xj)

·m∏

k=1

ϕmulti mismatch((xjk, xik+1), xjk+1, xik). (A.15)

where

ϕmulti mismatch((xi, xj+1), xi+1, xj) =

φsingle base pair stacking left((xi, xj+1), xi+1) · φsingle base pair stacking right((xi, xj+1), xj) (A.16)

This mirrors the affine energy models typically used for multi-branched loops in energy-

based methods.


A.3 The Viterbi algorithm

We now specify the Viterbi algorithm for computing the most likely structure via dynamic

programming recurrences. Letc be the maximum length of an internal loop or bulge.

A.3.1 Definitions

We define the following factors:

• ϕdo outer(i), 0 ≤ i ≤ L: the best possible score for folding the substringxi+1xi+2 · · ·xL,

assuming that the ends of this substring belong to the exterior loop of the RNA.

• ϕdo helix(i, j, n), 0 ≤ i ≤ j ≤ L

– 0 ≤ n < d: the best possible score for folding the substringxi+1xi+2 · · ·xj,

assuming that exactlyn letters on each side of the substring are paired in a

helix – i.e.,(xi, xj+1), (xi−1, xj+2), . . . , (xi−n+1, xj+n) all form base pairs, but

xi−n andxj+n+1 do not base pair.

– n = d: the best possible score for folding the substringxi+1xi+2 · · ·xj, assum-

ing that at leastd letters on each side of the substring are paired in a helix –

i.e., (xi, xj+1), (xi−1, xj+2), . . . , (xi−d+1, xj+d) all form base pairs, and possi-

bly more.

• ϕdo multi(i, j, n), 0 ≤ i ≤ j ≤ L

– 0 ≤ n < 2: the best possible score for folding the substringxi+1xi+2 · · ·xj,

assuming that the substring is part of a multibranch loop that contains exactly

containsn adjacent helices besides the exterior helix.

– n = 2: the best possible score for folding the substringxi+1xi+2 · · · xj, assum-

ing that the substring is part of a multibranch loop that contains exactly at least

2 adjacent helices besides the exterior helix.

A.3. THE VITERBI ALGORITHM 151

A.3.2 Recurrences

For each of the factors described in the previous subsection, we now give the appropriate

recurrence along with a description of the cases handled by the recurrence.

Exterior loop

When generating a substring belonging to the exterior loop, there are three cases:

1. the substring is of zero length,

2. the first base of the substring belongs to the exterior loop,

3. the first base belongs to a helix that is adjacent to the exterior loop.

This gives:

ϕdo outer(i) = max

1 if i = L

wouter unpaired· ϕdo outer(i+ 1) if 0 ≤ i < L

maxi′

i+2≤i′≤L

(φouter branch· ϕdo helix(i, i′, 0) · ϕdo outer(i

′)) if 0 ≤ i ≤ L.

Note that in the last case, we require thati + 2 ≤ i′ so as to ensure that the definition

of ϕdo outer(i) is not circular (actually, it would suffice to require thati < i′; however, the

requirement we make here works as well since a helix must contain at least two base pairs).

Helix

To generate a helix for the substringxi+1xi+2 · · ·xj, there are several cases:

1. no surrounding positions belong to the helix yet and(xi+1, xj) base pair,

2. n surrounding positions belong to the helix (where0 < n < d) and(xi+1, xj) base

pair,

3. at leastd surrounding positions belong to the helix and(xi+1, xj) base pair,


4. at least one surrounding position belongs to the helix andxi+1xi+2 · · ·xj form a

hairpin loop,

5. at least one surrounding position belongs to the helix andxi+1xi+2 · · ·xj form the

beginning of a single-branched loop,

6. at least one surrounding position belongs to the helix andxi+1xi+2 · · ·xj form the

beginning of a multi-branched loop.

This gives:

ϕdo helix(i, j, n) =

max

whelix change[1] · whelix closing(xi+1, xj) if 0 ≤ i < i + 2 ≤ j ≤ L andn = 0

· φhelix base pair(xi+1, xj) · ϕdo helix(i + 1, j − 1, 1)

whelix change[n + 1] · φhelix stacking((xi, xj+1), (xi+1, xj)) if 0 < i < i + 2 ≤ j < L and0 < n < d

· φhelix base pair(xi+1, xj) · ϕdo helix(i + 1, j − 1, n + 1)

whelix extend· φhelix stacking((xi, xj+1), (xi+1, xj)) if 0 0

Here, note that whenever a case depends on(xi, xj+1), we ensure that0 < i andj < L.

Also, if a case depends onxi+1 andxj, we ensure thati+ 2 ≤ j.

Loop

To generate a loop for the substringxi+1xi+2 · · ·xj, there are several cases:

1. xi+1xi+2 · · ·xj form a hairpin loop,

2. xi+1xi+2 · · ·xj form the beginning of a single-branched loop,

3. xi+1xi+2 · · ·xj form the beginning of a multi-branched loop.

A.3. THE VITERBI ALGORITHM 153

This gives:

ϕdo loop(i, j) =

max

ϕhairpin(i, j) if 0 0

maxi′,j′

i≤i′<i′+2≤j′≤j

1≤i′−i+j−j′≤c

(

wsingle(i, j, i′, j′) · ϕdo helix(i

′, j′, 0))

if 0 0

wmulti base· wmulti paired if 0 0.

· ϕmulti mismatch((xi, xj+1), xi+1, xj) · ϕdo multi(i, j, 0)

Note that in the case of single-branched loops,i′ + 2 ≤ j′ since the inner helix must have

at least one base pairing, and1 ≤ i′ − i + j − j′ ≤ c to ensure that the loop has length at

least 1, but no more thanc (for efficiency).

Multi-branched loops

To generate a multi-branched loop for the substringxi+1xi+2 · · ·xj, there are several cases:

1. the substring is of length zero and has at least 2 adjacent helices (other than the

exterior helix),

2. the first letter of the substring is unpaired,

3. the first letter of the substring belongs to a helix that is adjacent to the multi-branch

loop and fewer than 2 adjacent helices (other than the exterior helix) have been gen-

erated already.

4. the first letter of the substring belongs to a helix that is adjacent to the multi-branch

loop and at least 2 adjacent helices (other than the exteriorhelix) have been generated

already.


From this, we obtain

ϕdo multi(i, j, n) =

max

1 if 0 ≤ i = j ≤ L andn = 2

wmulti unpaired· ϕdo multi(i + 1, j, n) if 0 ≤ i < j ≤ L and0 ≤ n ≤ 2

maxj′

i+2≤j′≤j

wmulti paired· ϕmulti mismatch((xj′ , xi+1), xj′+1, xi)

· ϕdo helix(i, j′, 0) · ϕdo multi(j

′, j,min(2, n + 1))

if 0 < i ≤ j < L and0 ≤ n ≤ 2

As before, in the last case, the conditioni+ 2 ≤ j′ ensures thatxj′ andxi+1 are valid, and

the conditions0 < i andj < L ensure thatxj′+1 andxi are valid.

A.4. THE INSIDE ALGORITHM 155

A.4 The inside algorithm

The inside algorithm looks just like Viterbi, withmax’s replaced by∑

’s. We repeat these

recurrences here, for convenience:

For0 ≤ i ≤ L,

αdo outer(i) =∑

1 if i = L

wouter unpaired· αdo outer(i+ 1) if 0 ≤ i < L∑

i′

i+2≤i′≤L

(φouter branch· αdo helix(i, i′, 0) · αdo outer(i

′)) if 0 ≤ i ≤ L

For0 ≤ n ≤ d and0 ≤ i ≤ j ≤ L,

αdo helix(i, j, n) =

∑

whelix change[1] · whelix closing(xi+1, xj) if 0 ≤ i < i + 2 ≤ j ≤ L andn = 0

· φhelix base pair(xi+1, xj) · αdo helix(i + 1, j − 1, 1)

whelix change[n + 1] · φhelix stacking((xi, xj+1), (xi+1, xj)) if 0 < i < i + 2 ≤ j < L and0 < n < d

· φhelix base pair(xi+1, xj) · αdo helix(i + 1, j − 1, n + 1)

whelix extend· φhelix stacking((xi, xj+1), (xi+1, xj)) if 0 0

For0 ≤ i ≤ j ≤ L,

αdo loop(i, j) =

∑

ϕhairpin(i, j) if 0 0∑

i′,j′

i≤i′<i′+2≤j′≤j

1≤i′−i+j−j′≤c

(

wsingle(i, j, i′, j′) · αdo helix(i

′, j′, 0))

if 0 0

wmulti base· wmulti paired if 0 0.

· ϕmulti mismatch((xi, xj+1), xi+1, xj) · αdo multi(i, j, 0)


For0 ≤ n ≤ 2 and0 ≤ i ≤ j ≤ L,

αdo multi(i, j, n) =

∑

1 if 0 ≤ i = j ≤ L andn = 2

wmulti unpaired· αdo multi(i + 1, j, n) if 0 ≤ i < j ≤ L and0 ≤ n ≤ 2

∑

j′

i+2≤j′≤j

wmulti paired· ϕmulti mismatch((xj′ , xi+1), xj′+1, xi)

· αdo helix(i, j′, 0) · αdo multi(j

′, j,min(2, n + 1))

if 0 < i ≤ j < L and0 ≤ n ≤ 2

A.5. THE OUTSIDE ALGORITHM 157

A.5 The outside algorithm

The outside algorithm corresponding to the inside algorithm given in the previous section

is shown below:

For0 ≤ i ≤ L,

βdo outer(i) =∑

1 if i = 0

wouter unpaired· βdo outer(i− 1) if i > 0∑

i′

0≤i′≤i′+2≤i

(φouter branch· αdo helix(i′, i, 0) · βdo outer(i

′))

For0 ≤ n ≤ d and0 ≤ i ≤ j ≤ L,

βdo helix(i, j, n) =

∑

φouter branch· βdo outer(i) · αdo outer(j) if 0 ≤ i < i + 2 ≤ j ≤ L andn = 0∑

i′,j′

0<i′≤i<j≤j′<L

1′≤i−i′+j′−j≤c

(

wsingle(i′, j′, i, j) · βdo loop(i

′, j′))

if 0 < i < i + 2 ≤ j < L andn = 0

1∑

n′=0

∑

j′

j≤j′<L

wmulti paired· βdo multi(i, j′, n′)

· ϕmulti mismatch((xj , xi+1), xj+1, xi)

· αdo multi(j, j′, n′ + 1)

if 0 < i ≤ j < L andn = 0

∑

j′

j≤j′<L

wmulti paired· βdo multi(i, j′, 2)

· ϕmulti mismatch((xj , xi+1), xj+1, xi)

· αdo multi(j, j′, 2)

if 0 < i ≤ j < L andn = 0

whelix change[1] · whelix closing(xi, xj+1) if 0 < i ≤ j < L, andn = 1

· φhelix base pair(xi, xj+1) · βdo helix(i− 1, j + 1, 0)

whelix change[n] · φhelix stacking((xi−1, xj+2), (xi, xj+1)) if 1 < i ≤ j < L− 1, and1 < n ≤ d

· φhelix base pair(xi, xj+1) · βdo helix(i− 1, j + 1, n− 1)

whelix extend· φhelix stacking((xi−1, xj+2), (xi, xj+1)) if 1 0

For0 ≤ n ≤ 2 and0 ≤ i ≤ j ≤ L,

βdo multi(i, j, n) =

∑

wmulti base· wmulti paired if 0 < i < i + 2 ≤ j < L andn = 0

· ϕmulti mismatch((xi, xj+1), xi+1, xj) · βdo loop(i, j)

wmulti unpaired· βdo multi(i− 1, j, n) if 0 < i ≤ j ≤ L and0 ≤ n ≤ 2

∑

i′

1≤i′<i′+2≤i

wmulti paired· ϕmulti mismatch((xi, xi′+1), xi+1, xi′)

· αdo helix(i′, i, 0) · βdo multi(i

′, j, n− 1)

if 2 < i ≤ j < L and1 ≤ n ≤ 2

∑

i′

1≤i′<i′+2≤i

wmulti paired· ϕmulti mismatch((xi, xi′+1), xi+1, xi′)

· αdo helix(i′, i, 0) · βdo multi(i

′, j, 2)

if 2 < i ≤ j < L andn = 2.

A.6. POSTERIOR DECODING 159

A.6 Posterior decoding

Given the inside and outside matrices computed in the previous sections, we can now com-pute the posterior probabilities for paired and unpaired residues. Specifically, the posteriorprobabilitypij that nucleotidei pairs with nucleotidej (where1 ≤ i < j ≤ L) is given by

pij =1

Z(x)

∑

whelix change[1] · whelix closing(xi, xj) if 1 ≤ i < j ≤ L andn = 0

· φhelix base pair(xi, xj) · αdo helix(i, j − 1, 1)

· βdo helix(i− 1, j, 0)

d∑

n=2

whelix change[n] · φhelix stacking((xi−1, xj+1), (xi, xj))

· φhelix base pair(xi, xj) · αdo helix(i, j − 1, n)

· βdo helix(i− 1, j, n− 1)

if 1 < i < j < L

whelix extend· φhelix stacking((xi−1, xj+1), (xi, xj)) if 1 < i < j < L

· φhelix base pair(xi, xj) · αdo helix(i, j − 1, d)

· βdo helix(i− 1, j, d)

(A.17)

where

Z(x) = αdo outer(0) = βdo outer(L). (A.18)

Using these posterior probabilities, the posterior decoding algorithm can be used to find the

maximum expected accuracy parse.


A.7 Gradient

The gradient of the CONTRAfold conditional log-likelihood objective with respect to the

parametersw is

∇wℓ(w : D) =m∑

i=1

(

F(x(i), y(i))− Ey′∼P (y|x(i);w)[F(x(i), y′)])

,

where the expectation is taken with respect to the conditional distribution over structures

y′ for the sequencex(i) given by the current parametersw. We now describe the con-

struction process of a dynamic programming algorithm for computing the expectation

Ey′∼P (y|x(i);w)[F(x(i), y′)] based on modifying an implementation of the inside recurrences

from Section A.4.

First, initialize a vectorz ∈ Rn to the zero vector. In a typical implementation of the

inside algorithm, computing entries of inside table involves repetitions of statements of the

form

αa(i, j)← αa(i, j) + (product of someφ’s) · (product of someαa′(i′, j′)’s).

We will replace each such statement with several statements—one for eachφk appearing

in the product above. Specifically, for eachφk in the product, we will create a statement of

the form

zk ← zk +βa(i, j) · (product of someφ’s) · (product of someαa′(i′, j′)’s)

Z(x)

whereZ(x) = αdo outer(0). At the end of this modified inside algorithm, then, the vector z

will contain the desired feature expectations. For example, applying the transformation to

the rules for theαdo outerrecurrence is shown in Algorithm 5.

A.7. GRADIENT 161

Algorithm 5 CONTRAfold gradient computation.Initialize z← 0.for i← 0, . . . , L do

if i < L thenzouter unpaired← zouter unpaired+ βdo outer(i) · wouter unpaired· αdo outer(i+ 1).

end ifend forfor i′ ← i+ 2, . . . , L dozouter branch← zouter branch+ βdo outer(i) · φouter branch· αdo helix(i, i

′, 0) · αdo outer(i′).

end forreturn z.

Appendix B

Apppendix for Chapter 6

B.1 RAF features

The features used by the RAF program, as evaluated in this thesis, consist of alignment

features,φalignedp (i, k), and pairing features,φpaired

q (i, j; k, l). Specifically, the alignment

features,φaligned(i, k) ∈ R4 for a candidate alignment match(ai, bk) are

P (ai aligns withbk)

P (ai aligns withbk)2

logP (ai aligns withbk)

− logP (ai unaligned)− logP (bk unaligned)

. (B.1)

The pairing features,φpaired(i, j; k, l) ∈ R4 for a conserved base-pairing〈(ai, aj), (bk, bl)〉

are given byφpaired(i, j; k, l) = φpaired(ai, aj) + φpaired(bk, bl). In turn,φpaired(ai, aj) ∈ R4

is given by

P (ai pairs withaj)

P (ai pairs withaj)2

logP (ai pairs withaj)

− logP (ai unpaired)− logP (aj unpaired)

, (B.2)

162

B.2. THE RAF INFERENCE ENGINE 163

and similarly forφpaired(bk, bl). Thus, the model contains a total of 8 features whose weights

must be learned. Here, the posterior probabilities for aligned positions and base-pairing

positions are computed using the CONTRAlign (Do et al., 2006a)and CONTRAfold (Do

et al., 2006b) programs, respectively.

B.2 The RAF inference engine

In the section, we describe the RAF inference engine for fast approximate simultaneous

alignment and consensus folding for pairs of sequences. In particular, we first present

some exact recurrences for alignment and folding, and then use restrictions on the set of

allowed base-pairings and aligned positions to achieve an improvement in computational

complexity.

B.2.1 Recurrences

First, we describe a straightforwardO(L6) dynamic programming recurrence for comput-

ing the optimal simultaneous alignment and consensus fold for a pair of sequencesa and

b.

To compute the optimal parse ofa andb, we construct two four-dimensional matrices,

S andD. Here,Si,j;k,l denotes the optimal score for aligning and foldingai+1ai+2 . . . aj

with bk+1bk+2 . . . bl. Furthermore,Di,j;k,l denotes the optimal score for aligning and fold-

ing these same substrings, subject to the additional constraint that the outermost positions

(ai+1, aj) and(bk+1, bl) form conserved base-pairs.

For0 ≤ i ≤ j ≤ |a| and0 ≤ k ≤ l ≤ |b|, we have

Si,j;k,l = max

0 if i = j andk = l

Si,j−1;k,l if j > i

Si,j;k,l−1 if l > k

Si,j−1;k,l−1 + ψalignedw (j, l) if j > i andl > k

maxj′:i≤j′≤j−2l′:k≤l′≤l−2

(Si,j′;k,l′ +Dj′,j;l′,l),

(B.3)

164 APPENDIX B. APPPENDIX FOR CHAPTER 6

and for0 ≤ i < i+ 2 ≤ j ≤ |a| and0 ≤ k < k + 2 ≤ l ≤ |b|,

Di,j;k,l = Si+1,j−1;k+1,l−1 + ψpairedw (i+ 1, j; k + 1, l) + ψaligned

w (i+ 1, k + 1) + ψalignedw (j, l).

(B.4)

Here, recurrence (B.3) takes the form of a standard Needleman-Wunsch procedure for

aligning the substringai+1ai+2 . . . aj with bk+1bk+2 . . . bl, with an extra case to handle bi-

furcations in the base-pairing structure of the RNAs. At the end of the recurrence,S0,|a|;0,|b|

gives the score of the optimal alignment and consensus fold of the input sequencesa andb.

By using traceback pointers in the standard way, the optimal parse can be recovered easily

once the recurrence has been evaluated.

In the next section, we explore how these recurrences may be sped up considerably if

a constraint setC of allowed base-pairings and aligned positions is known ahead of time.

For complexity analysis, we assumeO(c) andO(d) bounds on the number of candidate

base-pairing and alignment partners per sequence position, respectively.

B.2.2 Exploiting base-pairing sparsity

LocARNA (Will et al., 2007) was the first program for simultaneous alignment and folding

of RNA to take advantage of base-pairing sparsity in a manner that significantly improved

in both running time and memory usage. In this section, we recount the innovations of

LocARNA as they are applied in RAF. In the next section, we extend these ideas to also

account for alignment sparsity.

First, observe that since all parses inYC contain only conserved base-pairings, the eval-

uation of (B.4) may be restricted to only thoseDi,j;k,l cells for which both(ai+1, aj) ∈ Cand(bk+1, bl) ∈ C. Similarly, the inner loop for consider bifurcations in (B.3) may also be

restricted to only thosej′ andl′ for which both(aj′+1, aj) ∈ C and(bl′+1, bl) ∈ C. Since the

bottleneck in the dynamic programming complexity is the thenumber of executions of the

innermost loop in (B.3), it follows that restricting the considered bifurcations in the manner

described above yields anO(c2L4) running time; in particular, for eachi andk, computing

all values ofSi,•;k,• takesO(c2L2) time as each entry of theD matrix is touched at most

once. This optimization was originally implemented as partof the LocARNA (Will et al.,


2007) and FoldAlignM (Torarinsson et al., 2007) algorithms.

Second, consider the task of computing all entries in theD matrix. From (B.4), we

see that the valuesDi,•;k,• depend only onSi+1,•;k+1,•. Similarly, from (B.3), the values

Si+1,•;k+1,• depend only onDj′,j;l′,l for j′ ≥ i+ 1 andl′ ≥ k + 1. Thus, ordering computa-

tions in the following way allows the recurrences to be evaluated in a single pass:

For i← |a| − 2 downto 0

Fork ← |b| − 2 downto 0

ComputeSi+1,•;k+1,•

ComputeDi,•;k,•

Furthermore, sinceSi+1,•;k+1,• is only needed while computingDi,•;k,• (but not for any

later values ofi andk), we need only retain oneSi+1,•;k+1,• matrix in memory at any given

time while computing theD matrix. This observation was originally incorporated in the

LocARNA program of Will et al. (2007).

Finally, observe that once theD matrix has been computed, the scoreS0,|a|;0,|b| of the

optimal parse is easily obtainable inO(c2L2) time by recomputingS0,•;0,•. Likewise, com-

puting the full traceback requires at mostO(c2L3) time, negligible relative to the cost of

computing theD matrix itself. Thus, we obtain an overallO(c2L4) time complexity with

O(c2L2) space complexity (for storing theD matrix).

B.2.3 Exploiting alignment sparsity

To exploit sparsity in the set of allowed aligned positions in C, we again use the strategy

of limiting the DP region. We accomplish this by first considering the simpler problem

of computing the reduced DP regionA (known as thealignment envelope) for pairwise

sequence alignment without folding scores. UsingA, we then define a reduced DP region

for our original alignment and folding task.


For the first step, consider the following restatement of recurrence (B.3) using the no-

tationSj,l = S0,j;0,l, where we have omitted the case involving bifurcations/base-pairing:

Sj,l = max

0 if j = 0 andl = 0

Sj−1,l if j > 0

Sj,l−1 if l > 0

Sj−1,l−1 + ψalignedw (j, l) if j > 0 andl > 0.

As before,Sj,l represents the optimal score of aligninga1a2 . . . aj to b1b2 . . . bl. Here, our

goal is to findA, the minimal set of cells containing no holes (that is,Sj,l ∈ A whenever

Sj1,l, Sj2,l ⊆ A for somej1 < j < j2, or Sj,l1 , Sj,l2 ⊆ A for somel1 < l < l2) such

that for every parsey ∈ YC, there exists some DP path throughA corresponding to an

alignment with the same set of aligned positions. Under the assumption thatA contains no

holes, we can representA by keeping track of its boundaries: for eachj ∈ 0, 1, . . . , |a|,let 〈A.FIRST[j],A.LAST[j]〉 denote the first and last positionsl ∈ 0, 1, . . . , |b| such that

Sjl ∈ A.

We compute these boundaries in linear time using the following procedure. First, we

adjust the boundaries to includeSj−1,l−1 ∈ A andSj,l ∈ A for each candidate aligning pair

(aj, bl) ∈ C. In addition, we also include the cornersS0,0 andS|a|,|b| inA. Finally, we force

the boundaries ofA to satisfy the monotonicity conditions

A.FIRST[0] ≤ A.FIRST[1] ≤ . . . ≤ A.FIRST[|a|]A.LAST[0] ≤ A.LAST[1] ≤ . . . ≤ A.LAST[|a|]

in such a way that guarantees all DP cellsSj,l ∈ A are accessible via some DP path from

S0,0 to S|a|,|b|.

For the second step, we define the reduced DP region for our original simultaneous

alignment and folding recurrences as the setR of all positionsSi,j;k,l such thatSi,k ∈ Aand Sj,l ∈ A. To use this reduced DP regionR, then, we simply forceSi,j;k,l = −∞for all Si,j;k,l /∈ R. Under this restriction, we can reduce the amount of computation

performed in the recurrence (B.3) by iterating only over cells Si,j;k,l ∈ R, and similarly,


restricting the evaluation of theD matrix in (B.4) to only those cellsDi,j;k,l for which

Si+1,j−1;k+1,l−1 ∈ R. To ensure that each allowed parse belongs toYC, we could penalize

any base-pairing or aligned position not inC by −∞. In practice, we instead augment

C to include all aligned matches allowed byR, since this can be done at no increase in

computational complexity.

To analyze the new computational complexity of the algorithm, we begin by bounding

the size ofD matrix in two different ways. First, for each of theO(cL) base-pairs(ai, aj) ∈C, there areO(d) aligning partners forai andO(d) aligning partners foraj, giving a total

size ofO(cd2L). Alternatively, for each of theO(dL) aligning pairs(ai, bk) ∈ C, there are

O(c) base-pairing partners forai andO(c) base-pairing partners forbk, giving a total size

of O(c2dL). Thus, the size of theD matrix isO(min(c, d) · cdL).

As in Section B.2.2, the space complexity of the algorithm is dominated by cost of

storing theD matrix, and hence, isO(min(c, d) · cdL). Similarly, the time complexity can

be estimated as the number of evaluations of the innermost loop in the bifurcation case

of (B.3). Since the innermost loop touches each entry of theD matrix at most once for

eachi andk, and since there areO(dL) choices of(ai, bk) ∈ A, it follows that the time

complexity of the algorithm isO(min(c, d) · cd2L2).

Note that in these bounds, we assume anO(c) bound on the number of base-pairing

partners per position, and anO(d) bound on the number of aligning partners per position.

A weaker condition would be to assume anO(cL) bound on the total number of candidate

base-pairing partners for sequencesa andb and similarly, anO(dL) bound on the total num-

ber of candidate aligned positions; under these conditions, we obtain a worst-case space

complexity ofO(min(c, d)2L2) and a worst case time complexity ofO(min(c, d)2dL3).


B.3 Norm bound

In this section, we derive a bound on the maximum norm of the optimal parameter vector

w∗ for (6.4). From standard arguments (see, e.g., Taskar et al.(2003)), the dual optimiza-

tion problem is

maximizeα∈Λ

m∑

i=1

∑

y′∈Y(i)C

∪y(i)

αi,y′∆(y(i), y′)− 1

2C||w(α)||2

where

Λ =

(αi,y′) : αi,y′ ≥ 0,∑

y′∈Y(i)C

∪y(i)

αi,y′ =1

m

w(α) =1

C

m∑

i=1

∑

y′∈Y(i)C

∪y(i)

αi,y′

(

F(x, y(i))− F(x, y′))

.

By strong duality, for any solutions(w∗, ξ∗) andα∗ of the primal and dual optimization

problems, respectively, the values of the primal and dual objectives must be equal, i.e.,

1

2C||w∗||2 +

1

m

m∑

i=1

ξ∗i =m∑

i=1

∑

y′∈Y(i)C

∪y(i)

α∗i,y′∆(y(i), y′)− 1

2C||w(α∗)||2. (B.5)

Now, suppose thatDi ∈ R for i = 1, . . . ,m satisfy

Di ≥ maxy′∈Y(i)

C∪y(i)

∆(y(i), y′). (B.6)

In the case of the RAF loss function, for example, we can use

Di = (|a|+ |b|)(

γFP paired+ γFN paired+ γFP aligned+ γFN aligned)

.

Then the KKT optimality conditionw∗ = w(α∗), the primal constraint thatξ∗i ≥ 0 for

B.3. NORM BOUND 169

i = 1, . . . ,m, and (B.5) imply that

C||w∗||2 =m∑

i=1

∑

y′∈Y(i)C

∪y(i)

α∗i,y′∆(y(i), y′)− 1

m

m∑

i=1

ξi

≤m∑

i=1

∑

y′∈Y(i)C

∪y(i)

α∗i,y′Di =

1

m

m∑

i=1

Di.

Therefore,||w∗|| ≤√

1m

∑mi=1Di/C.

Appendix C

Appendix for Chapter 7

C.1 Proof of Proposition 1

Suppose that(τ1, . . . , τT ) is an optimal solution such thatτi > 0 for somei 6= 1. Then,

consider the alternate solution(τ1 + τi, τ2, . . . , τi−1, 0, τi+1, . . . , τT ). It is easy to verify that

this latter solution achieves a strictly better objective function since the denominators of

each fraction in (7.5) either stay the same or increase, while the sumτ1:t is not affected.

Thus, we have a contradiction.

170

C.2. PROOF OF THEOREM 2 171

C.2 Proof of Theorem 2

Recalling that Algorithm 2 achieves a regret at most twice theoptimal regret for any choice

of τ1, . . . , τT , we have

RT ≤ 2 minτ1RT (τ1, 0, . . . , 0)

≤ 2RT (τ, 0, . . . , 0)

= 4τR2 +G2

T∑

i=1

1

λi+ τ

= 4τR2 +G2

λ

T∑

i=1

1

i+ τ/λ

= 4τR2 +G2

λ

T+τ/λ∑

i=1+τ/λ

1

i.

The upper bound follows immediately from the inequality,

b∑

i=a

1

i≤ 1

a+

∫ b

a

1

idi =

1

a+ log

(

b

a

)

for anya, b > 0 such thatb− a is a positive integer.

C.3 Proof of Corollary 3

Observe that

limλ→0

[

4τR2 +G2

λ

[

1

1 + τ/λ+ log

(

T + τ/λ

1 + τ/λ

)]]

(C.1)

= 4τR2 +G2

τ+ lim

λ→0

G2

λlog

(

λT + τ

λ+ τ

)

(C.2)

= 4τR2 +G2

τ+

(T − 1)G2

τ. (C.3)

172 APPENDIX C. APPENDIX FOR CHAPTER 7

The last step relies on the fact thatlimλ→0

[

1λ

log λT+xλ+x

]

= T−1x

; this, in turn, can be proven

using the series expansion,log z =∑∞

n=11n

(

z−1z

)n, for z > 1

2. The result follows immedi-

ately by substitutingτ = G√

T2R

.


Following the proof technique used to obtain Theorem 2 from Shalev-Shwartz et al. (2007),

observe that for any fixed sequence of (strongly) convex functionsf1, . . . , fT and iterates

w1, . . . ,wT chosen by our algorithm, we have the regret bound

T∑

t=1

ft(wt)−T∑

t=1

ft(w∗) ≤ RT .

Dividing through byT and taking expectations overf1, . . . , fT andw1, . . . ,wT ,

E

[

1

T

T∑

t=1

ft(wt)

]

− E

[

1

T

T∑

t=1

ft(w∗)

]

≤ RT

T.

Observing thatft does not depend onft+1, . . . , fT andwt+1, . . . ,wT , we can apply linear-

ity of expectations to the second term to obtain:

1

T

T∑

t=1

Ef1,...,ft,w1,...,wt[ft(w

∗)] ≤ 1

T

T∑

t=1

f(w∗) = f(w∗).

Similarly, for the first term,

1

T

T∑

t=1

Ef1,...,ft,w1,...,wt[ft(wt)]

=1

T

T∑

t=1

EwtEf1,...,ft,w1,...,wt−1|wt

[ft(wt)]

=1

T

T∑

t=1

Ewt[f(wt)] = ErEf1,...,ft,w1,...,wt

[f(wr)].

C.5. PROOF OF LEMMA 2 173

The claim follows immediately.

C.5 Proof of Lemma 2

Definew := arg minw:‖w‖≤R f(w). With probability at least1− δ, ǫ ≥ f(wr)− f(w) ≥λ2‖wr − w‖2, where the first inequality is a consequence of Proposition 2(and the choice

of T given in the algorithm), and the second inequality follows sincef isλ-strongly convex.

In this case, we have‖wr − w‖ ≤√

2ǫλ

. But since there was no increase inR, ‖wr‖ <

R−√

2ǫλ

. From the triangle inequality,‖w‖ < R, so the constraint‖w‖ ≤ R is not active

in the definition ofw. Hencew = w∗.


Observe that onceR ≥ 2√λ> 1√

λ+√

2ǫλ

,R will never increase again since our projections

ontoS ensure that‖wt‖ ≤ 1√λ

always. IfR is initialized to 1√λ, then no increases will ever

occur Shalev-Shwartz et al. (2007). Otherwise,R is initialized to 1 and is multiplied by√2 on each increase, so at mostp := 2 log2

(

2√λ

)

= 2 − log2 λ increases can occur. The

number of examples processed, then, is the sum over all phases of k times the number of

iterationsT needed to achieveǫ-accuracy in each phase. This can be bounded by either

p∑

i=0

16k(√

2i)2G2

δ2ǫ2=

16kG2

δ2ǫ2

p∑

i=0

2i ≤ 16kG22p+1

δ2ǫ2=

128kG2

δ2λǫ2.

or O((3 − log2 λ) G2

δλǫ), using the observations from Section 7.3.3. Finally, of allp + 1

phases, the probability of failure in each phase is at mostδ = δp+1

, so the total probability

of failure is at mostδ by a union bound.

174 APPENDIX C. APPENDIX FOR CHAPTER 7

C.7 Proof of Lemma 3

ExpandingD′t(α1, . . . , αt) − D′

t−1(α1, . . . , αt−1) using the definition of the dual objective

function, we obtain

∥

∥

∑t−1i=1(τiwi − αiai)

∥

∥

2

2(λ(t− 1) + τ1:t−1)−∥

∥

∑ti=1(τiwi − αiai)

∥

∥

2

2(λt+ τ1:t)+τt2‖wt‖2 + αtbt.

The first two terms simplify as

1

2

(

1λ(t−1)+τ1:t−1

− 1λt+τ1:t

)

∥

∥

∑t−1i=1(τiwi − αiai)

∥

∥

2

− (τtwt−αtat)T

Pt−1i=1(τiwi−αiai)

λt+τ1:t− ‖τtwt−αtat‖2

2(λt+τ1:t)

=(

λ(t−1)+τ1:t−1

2

)(

1− λ(t−1)+τ1:t−1

λt+τ1:t

)

‖wt‖2

−(

λ(t−1)+τ1:t−1

λt+τ1:t

)

(τtwt − αtat)Twt − ‖τtwt−αtat‖2

2(λt+τ1:t)

=(

λ+τt

2

)

(

1− λ+τt

λt+τ1:t

)

‖wt‖2

−(

1− λ+τt

λt+τ1:t

)

(τtwt − αtat)Twt − ‖τtwt−αtat‖2

2(λt+τ1:t).

Therefore,

D′t(α1, . . . , αt)−D′

t−1(α1, . . . , αt−1)

= λ+τt

2 ‖wt‖2 −wT

t (τtwt − αtat) + τt

2 ‖wt‖2 + αtbt

− (λ+τt)2

2(λt+τ1:t)‖wt‖2 + λ+τt

λt+τ1:t

wT

t (τtwt − αtat)− ‖τtwt−αtat‖2

2(λt+τ1:t)

=[

λ2 ‖wt‖2 + αta

T

t wt + αtbt

]

− ‖(λ+τt)wt−(τtwt−αtat)‖2

2(λt+τ1:t)

=[

λ2 ‖wt‖2 + aT

t wt + bt

]

− ‖λwt+at‖2

2(λt+τ1:t),

where in the last line, we setαt = 1. To complete the argument, it suffices to observe that

a Taylor expansion is exact at its point of expansion, and henceaT

t wt + bt = ℓ(wt).

C.8. PROOF OF PROPOSITION 3 175

C.8 Proof of Proposition 3

If we defineD′0(α0) = 0, observe that

P ′T (w∗) ≥ D′

T (αT ) =∑T

t=1(D′t(αt)−D′

t−1(αt−1))

≥∑Tt=1(D′

t([αt−1; 1])−D′t−1(αt−1))

=∑T

t=1

[

f(wt)− ‖λwt+at‖2

2(λt+τ1:t)

]

. (C.4)

where the first inequality follows from weak duality, the second inequality uses the fact that

αt maximizesD′t(α) with respect to the dual constraints, and the final step follows from

Lemma 3. Observe also that

P ′T (w∗) = T · PT (w∗) +

∑Ti=1

τi

2‖w∗ −wi‖2

≤ T · f(w∗) +∑T

i=1 2τiR2. (C.5)

Combining (C.4) and (C.5), we obtain

T∑

t=1

f(wt)− Tf(w∗) ≤T∑

t=1

[

2τtR2 +‖λwt + at‖22(λt+ τ1:t)

]

. (C.6)

Dividing through byT , using the given definitions ofR andAt, and finally, observing that

mint∈1,...,T f(wt) ≤ 1T

∑Tt=1 f(wt), the claim follows immediately.

Bibliography

Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. (2008). Optimal strategies and

minimax lower bounds for online convex games. InCOLT.

Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., and Walter, P. (2002).Molecular

Biology of the Cell, Fourth Edition. Garland.

Altschul, S. F. (1991). Amino acid substitution matrices from an information theoretic

perspective.J Mol Biol, 219:555–565.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and

Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs.Nucleic Acids Research, 25(17):3389–3402.

Andersen, L., Larsen, J., Hansen, L., and Hintz-Madsen, M. (1997). Adaptive regulariza-

tion of neural classifiers. InNNSP.

Andronescu, M., Condon, A., Hoos, H. H., Mathews, D. H., and Murphy, K. P. (2007).

Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics,

23:19–28.

Anguita, D., Ridella, S., Rivieccio, F., and Zunino, R. (2003).Hyperparameter design

criteria for support vector classifiers.Neurocomputing, 55:109–134.

Ban, N., Nissen, P., Hansen, J., Moore, P. B., and Steitz, T. A. (2000). The complete atomic

structure of the large ribosomal subunit at 2.4)aresolution.Science, 289:905–920.

Bartlett, P., Hazan, E., and Rakhlin, A. (2007). Adaptive online gradient descent. InNIPS.

176

BIBLIOGRAPHY 177

Bartlett, P. L., Collins, M., McAllester, D., and Taskar, B. (2004). Exponentiated gradient

algorithms for large margin methods for structured classification. InNIPS.

Bengio, Y. (2000). Gradient-based optimization of hyperparameters.Neural Computation,

12:1889–1900.

Bernal, A., Crammer, K., Hatzigeorgiou, A., and Pereira, F. (2007). Global discrimina-

tive learning for higher-accuracy computational gene prediction. PLoS Comput Biol,

3(3):e54.

Bertsekas, D. P., Nedic, A., and Ozdaglar, A. E. (2003).Convex analysis and optimization.

Athena Scientific.

Bilenko, M. and Mooney, R. J. (2005). Alignments and string similarity in information

integration: A random field approach. InProc. Dagstuhl Seminar on Machine Learning

for the Semantic Web.

Boyd, S. and Vandenberghe, L. (2004).Convex optimization. Cambridge UP.

Brion, P. and Westhof, E. (1997). Hierarchy and dynamics of RNAfolding. Annu Rev

Biophys Biomol Struct, 26:113–137.

Chapelle, O., Le, Q. V., and Smola, A. J. (2007). Large margin optimization of ranking

measures. InNIPS Workshop: Machine Learning for Web Search.

Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. (2002). Choosing multiple

parameters for support vector machines.Machine Learning, 46(1–3):131–159.

Chen, D. and Hagan, M. (1999). Optimal use of regularization and cross-validation in

neural network modeling. InIJCNN.

Collins, M. (2002). Discriminative training methods for hidden markov models: Theory

and experiments with perceptron algorithms. InEMNLP.

Collins, M., Globerson, A., Koo, T., Carreras, X., and Bartlett, P. (2007). Exponentiated

gradient algorithms for conditional random fields and max-margin Markov networks.J

Mach Learn Res.

178 BIBLIOGRAPHY

Crick, F. H. C. (1958). On protein synthesis. InSymp Soc Exp Biol XII, pages 139–163.

Culotta, A., Kulp, D., and McCallum, A. (2005). Gene prediction with conditional random

fields. Technical report UM-CS-2005-028, University of Massachusetts, Amherst.

D., P. K., T., T., and R., M. D. (2003). NCBI Reference Sequence project: update and

current status.Nucleic Acids Res, 31(1):34–37.

Dalli, D., Wilm, A., Mainz, I., and Steger, G. (2006). STRAL: progressive alignment of

non-coding RNA using base pairing probability vectors in quadratic time.Bioinformat-

ics, 22(13):1593–1599.

Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978). A modelof evolutionary

change in proteins. InAtlas of protein sequences and structure, volume 5 (Suppl 2),

pages 345–352.

DeCaprio, D., Vinson, J. P., Pearson, M. D., Montgomery, P., Doherty, M., and Galagan,

J. E. (2007). Conrad: gene prediction using conditional random fields. Genome Res,

17(9):1389–1398.

Dieffenbach, C. W., Lowe, T. M., and Dveksler, G. S. (1993). General concepts for PCR

primer design.PCR Methods Appl, 3:30–37.

Ding, Y. and Lawrence, C. (2003). A statistical sampling algorithm for RNA secondary

structure prediction.Nucleic Acids Res, 31(24):7280–7301.

Do, C. B., Foo, C. S., and Batzoglou, S. (2008). A max-margin modelfor efficient simul-

taneous alignment and folding of RNA sequences.Bioinformatics, 24(13):i68–i76.

Do, C. B., Foo, C. S., and Ng, A. Y. (2007). Efficient multiple hyperparameter learning for

log-linear models. InNIPS.

Do, C. B., Gross, S. S., and Batzoglou, S. (2006a). CONTRAlign: discriminative training

for protein sequence alignment. InRECOMB, pages 160–174.

Do, C. B. and Katoh, K. (2008). Protein multiple sequence alignment. In Thompson, J. D.,

Ueffing, M., and Schaeffer-Reiss, C., editors,Functional Proteomics. Humana Press.

BIBLIOGRAPHY 179

Do, C. B., Le, Q., and Foo, C. S. (2009). Proximal regularizationfor online and batch

learning. InICML.

Do, C. B., Mahabhashyam, M. S., Brudno, M., and Batzoglou, S. (2005). PROBCONS:

Probabilistic consistency-based multiple sequence alignment. Genome Res, 15(2):330–

340.

Do, C. B., Woods, D. A., and Batzoglou, S. (2006b). CONTRAfold: RNA secondary

structure prediction without physics-based models.Bioinformatics, 22(14):e90–e98.

Dowell, R. D. and Eddy, S. R. (2004). Evaluation of several lightweight stochastic context-

free grammars for RNA secondary structure prediction.BMC Bioinformatics, 5(71).

Dowell, R. D. and Eddy, S. R. (2006). Efficient pairwise RNA structure prediction and

alignment using sequence alignment constraints.BMC Bioinformatics, 7:400.

Duan, K., Keerthi, S. S., and Poo, A. (2003). Evaluation of simple performance measures

for tuning SVM hyperparameters.Neurocomputing, 51(4):41–59.

Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. (1998).Biological Sequence Analysis

: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

Eddy, S. R. (2002). Computational genomics of noncoding RNA genes.Cell, 109:137–140.

Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models.

Nucleic Acids Res, 22(11):2079–2088.

Edgar, R. C. (2004a). Local homology recognition and distancemeasures in linear time

using compressed amino acid alphabet.Nucleic Acids Res, 32(1):380–385.

Edgar, R. C. (2004b). MUSCLE: Low-complexity multiple sequence alignment with T-

Coffee accuracy. InISMB/ECCB.

Edgar, R. C. (2004c). MUSCLE: multiple sequence alignment withhigh accuracy and high

throughput.Nucleic Acids Res, 32(5):1792–1797.

180 BIBLIOGRAPHY

Eigenmann, R. and Nossek, J. A. (1999). Gradient based adaptive regularization. InNNSP,

pages 87–94.

Eyrich, V. A., Mart’i-Renom, M. A., Przybylski, D., Madhusudhan, M. S., Fiser, A., Pazos,

F., Valencia, A., Sali, A., and Rost, B. (2001). EVA: continuous automatic evaluation of

protein structure prediction servers.Bioinformatics, 17(12):1242–1243.

Feng, D. F. and Doolittle, R. F. (1987). Progressive sequencealignment as a prerequisite

to correct phylogenetic trees.J Mol Evol, 25:351–360.

Finley, T. and Joachims, T. (2008). Training structural svms when exact inference is in-

tractable. InICML ’08: Proceedings of the 25th international conference on Machine

learning, pages 304–311, New York, NY, USA. ACM.

Flannick, J., Novak, A., Do, C. B., Srinivasan, B. S., and Batzoglou, S. (2008). Automatic

parameter learning for multiple network alignment. InRECOMB.

Furtig, B., Richter, C., Wohnert, J., and Schwalbe, H. (2003). NMR spectroscopy of RNA.

Chembiochem, 4(10):936–962.

Gardner, P. P. and Giegerich, R. (2004). A comprehensive comparison of comparative RNA

structure prediction approaches.BMC Bioinformatics, 5(140).

Gardner, P. P., Wilm, A., and Washietl, S. (2005). A benchmark of multiple sequence

alignment programs upon structural RNAs.Nucleic Acids Res, 33(8):2433–2439.

Garey, M. R. and Johnson, D. S. (1979).Computers and intractability. Freeman.

Getoor, L. and Taskar, B., editors (2007).Introduction to Statistical Relational Learning.

MIT Press, Cambridge, MA.

Glasmachers, T. and Igel, C. (2005). Gradient-based adaptation of general Gaussian ker-

nels.Neural Comp., 17(10):2099–2105.

Globerson, A., Koo, T. Y., Carreras, X., and Collins, M. (2007). Exponentiated gradient

algorithms for log-linear structured prediction. InICML, pages 305–312.

BIBLIOGRAPHY 181

Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992). Exhaustive matching of the entire

protein sequence database.Science, 256(5062):1609–1610.

Gorodkin, J., Heyer, L. J., and Stormo, G. D. (1997). Findingthe most significant com-

mon sequence and structure motifs in a set of RNA sequences.Nucleic Acids Res,

25(18):3724–3732.

Gorodkin, J., Stricklin, S. L., and Stormo, G. D. (2001). Discovering common stem-loop

motifs in unaligned RNA sequences.Nucleic Acids Res, 29:2135–2144.

Gotoh, O. (1982). An improving algorithm for matching biological sequences.J Mol Biol,

162:705–708.

Goutte, C. and Larsen, J. (1998). Adaptive regularization ofneural networks using conju-

gate gradient. InICASSP.

Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., and Eddy., S. R. (2003). Rfam:

an RNA family database.Nucleic Acids Res, 31(1):439–441.

Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S. R., and Bateman, A.

(2005). Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res,

33:D121–D124.

Gross, S. S., Do, C. B., Sirota, M., and Batzoglou, S. (2007). CONTRAST: a discrimina-

tive phylogeny-free approach to multiple informant de novogene prediction.Genome

Biology, 8(R269).

Gutell, R. R., Lee, J. C., and Cannone, J. J. (2002). The accuracy of ribosomal RNA

comparative structure models.Current Opinion in Structural Biology, 12:301–310.

Harmanci, A. O., Sharma, G., and Mathews, D. H. (2007). Efficient pairwise rna structure

prediction using probabilistic alignment constraints in Dynalign. BMC Bioinformatics,

8(130).

Havgaard, J. H., Lyngsø, R. B., and Gorodkin, J. (2005). The FOLDALIGN web server

for pairwise structural RNA alignment and mutual motif search. Nucleic Acids Res, 33

(Web Server Issue):W650–W653.

182 BIBLIOGRAPHY

Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online

convex optimization.Mach Learn, 69(2-3):169–192.

Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein

blocks.Proc Nat Acad Sci USA, 89:10915–10919.

Heringa, J. (2002). Local weighting schemes for protein multiple sequence alignment.

Computers and Chemistry, 26:459–477.

Hofacker, I. L., Bernhart, S. H. F., and Stadler, P. F. (2004).Alignment of RNA base pairing

probability matrices.Bioinformatics, 20(14):2222–2227.

Hofacker, I. L., Fekete, M., and Stadler, P. F. (2002). Secondary structure prediction for

aligned RNA sequences.J Mol Biol, 319:1059–1066.

Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer, L. S., and Schuster, P. (1994).

Fast folding and comparison of RNA secondary structures (TheVienna RNA Package).

Monatsh Chem, 125:167–188.

Holmes, I. (2005). Accelerated probabilistic inference ofRNA structure evolution.BMC

Bioinformatics, 6(73).

Holmes, I. and Durbin, R. (1998). Dynamic programming alignment accuracy.J Comp

Biol, 5(3):493–504.

Joachims, T. (2006). Training linear SVMs in linear time. InKDD, pages 217–226.

Joachims, T., Galor, T., and Elber, R. (2005). Learning to align sequences: a maximum-

margin approach. InNew algorithms for macromolecular simulation.

Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scor-

ing matrices.J Mol Biol, 292(2):195–202.

Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two sets of

vectors.Acta Crystallog sect A, 34:827–828.

BIBLIOGRAPHY 183

Kakade, S. and Shalev-Shwartz, S. (2008). Mind the duality gap: logarithmic regret algo-

rithms for online optimization. InNIPS.

Kapoor, A., Qi, Y., Ahn, H., and Picard, R. W. (2006). Hyperparameter and kernel learning

for graph based semi-supervised classification. InNIPS, pages 627–634.

Karchin, R., Cline, M., Mandel-Guttfreund, Y., and Karplus, K. (2003). Hidden markov

models that use predicted local structure for fold recognition: alphabets of backbone

geometry.Proteins: Structure, Function, and Genetics, 51(4):504–514.

Katoh, K., Kuma, K., Toh, H., and Miyata, T. (2005). MAFFT version 5: improvement in

accuracy of multiple sequence alignment.Nucleic Acids Res, 33:511–518.

Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002). MAFFT: a novel method for

rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res,

30:3059–3066.

Keerthi, S. S. (2002). Efficient tuning of SVM hyperparameters using radius/margin bound

and iterative algorithms.IEEE Transaction on Neural Networks, 13(5):1225–1229.

Keerthi, S. S., Sindhwani, V., and Chapelle, O. (2007). An efficient method for gradient-

based adaptation of hyperparameters in SVM models. InNIPS.

Kiryu, H., Tabei, Y., Kin, T., and Asai, K. (2007). Murlet: a practical multiple alignment

tool for structural RNA sequences.Bioinformatics, 23(13):1588–1598.

Kiwiel, K. C. (1983). Proximity control in bundle methods forconvex nondifferentiable

minimization.Math Program, 27:320–341.

Knudsen, B. and Hein, J. (1999). RNA secondary structure prediction using stochastic

context-free grammars and evolutionary history.Bioinformatics, 15(6):446–454.

Knudsen, B. and Hein, J. (2003). Pfold: RNA secondary structure prediction using stochas-

tic context-free grammars.Nucleic Acids Res, 31(13):3423–3428.

184 BIBLIOGRAPHY

Kobayashi, K., Kitakoshi, D., and Nakano, R. (2005). Yet faster method to optimize SVR

hyperparameters based on minimizing cross-validation error. In IJCNN, volume 2, pages

871–876.

Kobayashi, K. and Nakano, R. (2004). Faster optimization of SVR hyperparameters based

on minimizing cross-validation error. InIEEE Conference on Cybernetics and Intelligent

Systems.

Kohavi, R. (1995). A study of cross-validation and bootstrapfor accuracy estimation and

model selection. InIJCAI, pages 1137–1145.

Krieger, E., Hooft, R. W. W., Nabuurs, S., and Vriend, G. (2004). PDBFinderII—a database

for protein structure analysis and prediction.Submitted.

Kulesza, A. and Pereira, F. (2008). Structured learning with approximate inference. In

Platt, J., Koller, D., Singer, Y., and Roweis, S., editors,Advances in Neural Information

Processing Systems 20, pages 785–792. MIT Press, Cambridge, MA.

Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: probabilistic

models for segmenting and labeling sequence data. InICML 18, pages 282–289.

Larsen, J., Hansen, L. K., Svarer, C., and Ohlsson, M. (1996a). Design and regularization

of neural networks: the optimal use of a validation set. InNNSP.

Larsen, J., Svarer, C., Andersen, L. N., and Hansen, L. K. (1996b). Adaptive regularization

in neural network modeling. InNeural Networks: Tricks of the Trade, pages 113–132.

Lemarechal, C., Nemirovskii, A., and Nesterov, Y. (1995). New variants of bundle methods.

Math Program, 69:111–147.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and

reversals.Soviet Physics Doklady.

Lindgreen, S., Gardner, P. P., and Krogh, A. (2007). MASTR: multiple alignment and

structure prediction of non-coding RNAs using simulated annealing. Bioinformatics,

23(24):3304–311.

BIBLIOGRAPHY 185

Liu, Y., Carbonell, J., Weigele, P., and Gopalakrishnan, V. (2006). Protein fold recognition

using segmentation conditional random fields (SCRFs).J Comput Biol, 3(2):394–406.

MacKay, D. J. C. (1992). Bayesian interpolation.Neural Computation, 4(3):415–447.

MacKay, D. J. C. and Takeuchi, R. (1998). Interpolation modelswith multiple hyperpa-

rameters.Statistics and Computing, 8:15–23.

Marti-Renom, M. A., Staurt, A. C., Fiser, A., Sanchez, R., Melo,F., and Sali (2000). Com-

parative protein structure modeling of genes and genomes.Annu Rev Biophys Biomol

Struct, 29:291–325.

Martins, J. R. R. A., Sturdza, P., and Alonso, J. J. (2003). The complex-step derivative

approximation.ACM Trans. Math. Softw., 29(3):245–262.

Mathews, D. H., Sabina, J., Zuker, M., and Turner, D. H. (1999). Expanded sequence de-

pendence of thermodynamic parameters improves predictionof RNA secondary struc-

ture. J Mol Biol, 288(5):911–940.

Mathews, D. H. and Turner, D. H. (2002). Dynalign: an algorithm for finding the secondary

structure common to two RNA sequences.J Mol Biol, 317(2):191–203.

Matthews, B. W. (1975). Comparison of predicted and observed secondary structure of T4

phage lysozyme.Biochim. Biophys. Acta., 405:442–451.

Mattick, J. S. (2004). RNA regulation: a new genetics?Nat Rev Genet, 5:316–323.

McCallum, A., Bellare, K., and Pereira, F. (2005). A conditional random field for

discriminatively-trained finite-state string edit distance. InProc. UAI.

McCaskill, J. S. (1990). The equilibrium partition functionand base pair binding probabil-

ities for RNA secondary structure.Biopolymers, 29:1105–1119.

Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. InUAI,

volume 17, pages 362–369.

186 BIBLIOGRAPHY

Mizuguchi, K., Deane, C. M., Blundell, T. L., and Overington, J. P. (1998). HOMSTRAD:

a database of protein structure alignments for homologous familes.Protein Sci, 7:2469–

2471.

Morrison, D. A. (2006). Multiple sequence alignment for phylogenetic purposes.Aus-

tralian Systematic Botany, 19(6):479–539.

Moulton, V. (2005). Tracking down noncoding RNAs.Proc Nat Acad Sci USA,

102(7):2269–2270.

Muller, T. and Vingron, M. (2000). Modeling amino acid replacement. J Comput Biol,

7:761–776.

Murray, I. and Ghahramani, Z. (2004). Bayesian learning in undirected graphical models:

approximate MCMC algorithms. InUAI, pages 392–399.

Murzin, A. G., Brenner, S. E., T., H., and C., C. (1995). SCOP: a structural classification of

proteins database for the investigation of sequences and structures.J Mol Biol, 247:536–

540.

Neal, R. M. (1996).Bayesian Learning for Neural Networks. Springer.

Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search

for similarities in the amino acid sequence of two proteins.J Mol Biol, 48:443–453.

Nesterov, Y. (2003). Introductory Lectures on Convex Optimization: A Basic Course.

Springer.

Ng, A. and Jordan, M. (2002). On discriminative vs. generative classifiers: a comparison

of logistic regression and naive Bayes. InNIPS 14.

Ng, A. Y. (1997). Preventing overfitting of cross-validation data. InICML, pages 245–253.

Ng, A. Y. (2004). Feature selection,L1 vs.L2 regularization, and rotational invariance. In

ICML.

Nocedal, J. and Wright, S. J. (1999).Numerical Optimization. Springer.

BIBLIOGRAPHY 187

Notredame, C., Higgins, D., and Heringa, J. (2000). T-Coffee:A novel method for multiple

sequence alignments.J Mol Biol, 302:205–217.

Nussinov, R. and Jacobson, A. B. (1980). Fast algorithm for predicting the secondary

structure of single-stranded RNA.Proc Natl Acad Sci USA, 77(11):6309–6313.

Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and Thornton,

J. M. (1997). CATH – a hierarchic classification of protein domain structures.Structure,

5(8):1093–1108.

O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G., and Notredame, C. (2004). 3DCof-

fee: Combining protein sequences and structures within multiple sequence alignments.

J Mol Biol, 340:385–395.

Papanicolaou, C., Gouy, M., and Ninio, J. (1984). An energy model that predicts the

correct folding of both the tRNA and the 5S RNA molecules.Nucleic Acids Res, 12(1 Pt

1):31–44.

Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Comp,

6(1):147–160.

Perriquet, O., Touzet, H., and Dauchet, M. (2003). Finding the common structure shared

by two homologous RNAs.Bioinformatics, 19(1):108–116.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992).Numerical

Recipes in C: The Art of Scientific Computing. Cambridge UP, New York, NY, USA.

Prlic, A., Domingues, F. S., and Sippl, M. J. (2000). Structure-derived substitution matrices

for alignment of distantly related sequences.Protein Eng, 13:545–550.

Qi, Y., Szummer, M., and Minka, T. P. (2005). Bayesian conditional random fields. In

AISTATS.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in

speech recognition.Proceedings of the IEEE, 77(2):257–286.

188 BIBLIOGRAPHY

Raghava, G. P. S., Searle, S. M. J., Audley, P. C., Barber, J. D., and Barton, G. J. (2003).

OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy.

BMC Bioinformatics, 4(47).

Reeder, J. and Giegerich, R. (2004). Design, implementation and evaluation of a practical

pseudoknot folding algorithm based on thermodynamics.BMC Bionformatics, 5.

Reeder, J., Steffen, P., and Giegerich, R. (2005). Effective ambiguity checking in biose-

quence analysis.BMC Bioinformatics, 6(153).

Reese, J. T. and Pearson, W. R. (2002). Empirical determination of effective gap penalties

for sequence comparison.Bioinformatics, 18(11):1500–1507.

Rivas, E. and Eddy, S. R. (1999). A dynamic programming algorithm for RNA structure

prediction including pseudoknots.J Mol Biol, 285:2053–2068.

Rivas, E. and Eddy, S. R. (2000). Secondary structure alone is generally not statistically

significant for the detection of noncoding RNAs.Bioinformatics, 16(7):583–605.

Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng, 12(2):85–94.

Rouillard, J. M., Zuker, M., and Gulari, E. (2003). OligoArray 2.0: design of oligonu-

cleotide probes for DNA microarrays using a thermodynamic approach.Nucleic Acids

Res, 31(12):3057–3062.

Ruan, J., Stormo, G. D., and Zhang, W. (2004). An iterated loopmatching approach to the

prediction of RNA secondary structures with pseudoknots.Bioinformatics, 20(1):58–66.

Sankoff, D. (1985). Simultaneous solution of the RNA folding, alignment and protose-

quence problems.SIAM J. Appl. Math., 45:810–825.

Sato, K. and Sakakibara, Y. (2005). RNA secondary structuralalignment with conditional

random fields.Bioinformatics, 21 (Suppl 2):ii237–ii242.

Schramm, H. and Zowe, J. (1992). A version of the bundle idea for minimizing a non-

smooth function: conceptual idea, convergence analysis, numerical results. SIAM J

Optim, 2:121–152.

BIBLIOGRAPHY 189

Schulze, U., Hepp, B., Ong, C. S., and R atsch, G. (2007). PALMA:mRNA to genome

alignments using large margin algorithms.Bioinformatics, 23(15):1892–1900.

Seeger, M. (2007). Cross-validation optimization for largescale hierarchical classification

kernel methods. InNIPS.

Sha, F. and Pereira, F. (2003). Shallow parsing with conditional random fields. InNAACL,

pages 134–141.

Shalev-Shwartz, S., Singer, Y., and Srebro, N. (2007). Pegasos: Primal estimated sub-

gradient solver for svm. InICML, pages 807–814.

Shi, J., Blundell, T. L., and Mizuguchi, K. (2001). FUGUE: sequence-structure homology

recognition using environment-specific substitution tables and structure-dependent gap

penalties.J Mol Biol, 310:243–257.

Simossis, V. A. and Heringa, J. (2005). PRALINE: a multiple alignment toolbox that

integrates homology-extended and secondary structure information.Nucleic Acids Res,

33 (Web Server Issue):W289–W294.

Simossis, V. A., Kleinjung, J., and Heringa, J. (2005). Homology-extended sequence align-

ment.Nucleic Acids Res, 33(3):816–824.

Smola, A., Vishwanathan, S. V. N., and Le, Q. (2007). Bundle methods for machine learn-

ing. In NIPS.

Sneath, P. H. and Sokal, R. R. (1962). Numerical taxonomy.Nature, 193:855–860.

Stone, E. A. and Sidow, A. (2005). Physicochemical constraint violation by missense

substitutions mediates impairment of protein function anddisease severity.Genome

Res, 15:978–986.

Sundararajan, S. and Keerthi, S. S. (2001). Predictive approaches for choosing hyperpa-

rameters in Gaussian processes.Neural Comp., 13(5):1103–1118.

190 BIBLIOGRAPHY

Sutton, C. and McCallum, A. (2007). An introduction to conditional random fields for

relational learning. In Getoor, L. and Taskar, B., editors,Introduction to Statistical Re-

lational Learning. MIT Press, Cambridge, MA.

Tabei, Y., Tsuda, K., Kin, T., and Asai, K. (2006). SCARNA: fastand accurate structural

alignment of RNA sequences by matching fixed-length stem fragments.Bioinformatics,

22(14):1723–1729.

Taskar, B., Chatalbashev, V., Koller, D., and Guestrin, C. (2005). Learning structured

prediction models: A large margin approach. InICML.

Taskar, B., Guestrin, C., and Koller, D. (2003). Max-margin markov networks. InNIPS 16.

Taylor, W. R. and Orengo, C. A. (1989). Protein structure alignment.J Mol Biol, 208:1–22.

Teo, C. H., Le, Q., Smola, A. J., and Wishwanathan, S. V. N. (2007). A scalable modular

convex solver for regularized risk minimization. InKDD.

The Encode Project Consortium (2007). Identification and analysis of functional elements

in 1% of the human genome by the ENCODE pilot project.Nature, 447:799–816.

Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL W: Improving

the sensitivity of progressive multiple sequence alignment through sequence weight-

ing, position-specific gap penalties, and weight matrix choice. Nucleic Acids Res,

22(22):4673–4680.

Thompson, J. D., Koehl, P., Ripp, R., and Poch, O. (2005). BAliBASE 3.0: Latest devel-

opments of the multiple sequence alignment benchmark.Proteins, 61:127–136.

Thompson, J. D., Plewniak, F., and Poch, O. (1999). A comprehensive comparison of

multiple sequence alignment programs.Nucleic Acids Res, 27(13):2682–2690.

Tinoco, I., Uhlenbeck, O. C., and Levine, M. D. (1971). Estimation of secondary structure

in ribonucleic acids.Nature, 230:362–367.

Torarinsson, E., Havgaard, J. H., and Gorodkin, J. (2007). Multiple structural alignment

and clustering of RNA sequences.Bioinformatics, 23(8):926–932.

BIBLIOGRAPHY 191

Torarinsson, E., Sawera, M., Havgaard, J. H., Fredholm, M.,and Gorodkin, J. (2006).

Thousands of corresponding human and mouse genomic regionsunalignable in primary

sequence contain common RNA structure.Genome Research, 16:885–889.

Touzet, H. and Perriquet, O. (2004). CARNAC: folding families of related RNAs.Nucleic

Acids Res, 32 (Web Server):W142–W145.

Turner, D. H., Sugimoto, N., and Freier, S. M. (1988). RNA structure prediction.Ann Rev

Biophys Biophys Chem, 17:167–192.

Vapnik, V. N. (1998).Statistical Learning Theory. Wiley-Interscience.

Vingron, M. and Waterman, M. S. (1994). Sequence alignment and penalty choice. review

of concepts, case studies and implications.J Mol Biol, 235(1):1–12.

Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., and Murphy, K. P. (2006).

Accelerated training of conditional random fields with stochastic gradient methods. In

ICML, pages 969–976.

Wainwright, M. J. (2006). Estimating the ”wrong” graphicalmodel: Benefits in the

computation-limited setting.J. Mach. Learn. Res., 7:1829–1859.

Wallace, I. M., O’Sullivan, O., Higgins, D. G., and Notredame, C. (2006). M-Coffee:

combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res,

34(6):1692–1699.

Walle, I. V., Lasters, I., and Wyns, L. (2005). Sabmark – a benchmark for sequence align-

ment that covers the entire known fold space.Bioinformatics, 21(7):1267–1268.

Wang, X. and Padgett, R. A. (1989). Hydroxyl radical ”footprinting” of RNA: application

to pre-mRNA splicing complexes.Proc Natl Acad Sci USA, 86:7795–7799.

Wellings, M. and Parise, S. (2006). Bayesian random fields: the Bethe-Laplace approxi-

mation. InICML.

Wexler, Y., Zilberstein, C., and Ziv-Ukelson, M. (2007). A study of accessible motifs and

RNA folding complexity.J Comput Biol, 14(6):856–872.

192 BIBLIOGRAPHY

Whelan, S. and Goldman, N. (2001). A general empirical model of protein evolution de-

rived from multiple protein families using a maximum-likelihood approach.Mol Biol

Evol, 18:691–699.

Will, S., Reiche, K., Hofacker, I. L., Stadler, P. F., and Backofen, R. (2007). Inferring non-

coding RNA families and classes by means of genome-scale structure-based clustering.

PLoS Comput Biol, 3(4).

Williams, C. K. I. and Barber, D. (1998). Bayesian classification with Gaussian processes.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.

Wolfinger, M. T., Svrcek-Seiler, W. A., Flamm, C., Hofacker, I. L., and Stadler, P. F. (2004).

Efficient computation of RNA folding dynamics.J Phys A: Math Gen, 37:4731–4741.

Xu, X., Ji, Y., and Stormo, G. D. (2007). RNA Sampler: a new sampling based algorithm

for common RNA secondary structure prediction and structural alignment.Bioinformat-

ics, 23(15):1883–1891.

Ying, X., Luo, H., Luo, J., and Li, W. (2004). RDfolder: a web server for prediction of

RNA secondary structure.Nucleic Acids Res, 32 (Web Server Issue):W150–W153.

Zhang, X. and Lee, W. S. (2007). Hyperparameter learning forgraph based semi-supervised

learning algorithms. InNIPS.

Zhou, H. and Zhou, Y. (2005). SPEM: improving multipl sequence alignment with se-

quence profiles and predicted secondary structures.Bioinformatics, 21(18):3615–3621.

Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient

ascent. InICML.

Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction.

Nucleic Acids Res, 31(13):3406–3415.

Documents

DISCRIMINATIVE STRUCTURED MODELS FOR BIOLOGICAL …ai.stanford.edu/~chuongdo/papers/thesis_2_sided.pdf · DISCRIMINATIVE STRUCTURED MODELS FOR BIOLOGICAL ... Vasco Chatalbashev, Adam