Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

EfficientProbabilistic Logic Programming

forBiological Sequence Analysis

Christian Theil HaveResearch group PLIS: Programming, Logic and Intelligent Systems

Department of Communication, Business and Information TechnologiesRoskilde University

February 6, 2013


OUTLINE

INTRODUCTION

DomainResearch questions

BackgroundGene findingProbabilistic Logic Programming

A peek into my workOverview of papersThe trouble with tabling of structured dataConstrained HMMsApplications: Genome models

Conclusions


INTRODUCTION


BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformatics

I Analyze biologicalsequences

I DNAI RNAI Proteins

I to understand

I FeaturesI FunctionsI Evolutionary

relationships



I Subfield of bioinformaticsI Analyze biological

sequences

I DNAI RNAI Proteins

I to understand


relationships




sequencesI DNA

I RNAI Proteins

I to understand


relationships




sequencesI DNAI RNA

I ProteinsI to understand


relationships




sequencesI DNAI RNAI Proteins

I to understand


relationships





I to understand


relationships





I to understandI Features

I FunctionsI Evolutionary

relationships





I to understandI FeaturesI Functions

I Evolutionaryrelationships





I to understandI FeaturesI FunctionsI Evolutionary

relationships


PROBABILISTIC LOGIC PROGRAMMING

I Declarative programming paradigmI Ability to express common and complex models used in

biological sequence analysisI Concise expression of complex modelsI Separation between logic and control

I Generic inference algorithmsI Transformations


MODELS FOR BIOLOGICAL SEQUENCE ANALYSIS

I Reflect relationships between features of sequence dataI Embody constraints – assumptions about dataI Infer information from dataI Reasoning about uncertainty→ probabilities


THE LOST PROJECT

. . . seeks to improve ease of modeling, accuracy and reliability ofsequence analysis by using logic-statistical models . . .

Key focus areas:

I The PRISM systemI Prokaryotic gene finding

My Ph.D. project is part of the LoSt project and share thesefocus areas.


RESEARCH QUESTIONS

1. To what extent is it possible to use probabilistic logicprogramming for biological sequence analysis?

2. How can constraints relevant to the domain of biologicalsequence analysis be combined with probabilistic logicprogramming?

3. What are the limitations with regard to efficiency and how canthese be dealt with?

I believe that these are the central questions that need beaddressed in order to be able to construct useful tools forbiological sequence analysis using probabilistic logicprogramming.


RELATIONS BETWEEN RESEARCH QUESTIONS

1. To what extent is it possible to useprobabilistic logic programming forbiological sequence analysis?

2. How can constraints relevant to thedomain of biological sequence analysisbe combined with probabilistic logicprogramming?

3. What are the limitations with regard toefficiency and how can these be dealtwith?


APPROACH

To build and evaluate

I ApplicationsI AbstractionsI Optimizations

for biological sequence analysis using probabilistic logicprogramming.


APPROACH

I ApplicationsI Deal with relevant biological sequence analysis problemsI Potentially to contribute new knowledge to biology or

bioinformaticsI Direct substantiation with regard to research question 1

I AbstractionsI Ease modelingI Language for incorporating constraints from the domainI A higher level of declarativity;I Focus on problem rather than implementation (model)

detailsI Optimizations

I Deal with limitations of probabilistic logic programmingthat may hinder its use in biological sequence analysis.

I Efficient inference is a precondition for practical use.


BACKGROUND

I Prokaryotic gene findingI Probabilistic logic programming


PROKARYOTIC GENE FINDING

Identify regions of DNA which encode proteins:

A (prokaryotic) gene is a consecutive stretch of DNA which,I is transcribed as part of an RNAI is translated to a complete protein andI has a length which is a multiple of three (codons)I starts with a “start” codonI last codon is a “stop” codon


GENES AND OPEN READING FRAMES

The identification of prokaryotic genes may be decomposedinto two distinct problems:

1. Identification of ORFs which contain protein coding genes.2. Identification of the correct start codon within an ORF.

〈ORF〉 ::= 〈start〉〈not-stop〉* 〈stop〉〈start〉 ::= TTG | CTG | ATT | ATC | ATA | ATG | GTG〈stop〉 ::= TAA | TAG | TGA

〈not-stop〉 ::= AAA | ... | TTT //all codons except those in 〈stop〉


SIGNALS FOR PROKARYOTIC GENE FINDING

I Open reading framesI LengthI Nucleotide sequence compositionI Conservation (sequence similarity in other organisms)

I Local contextI PromotersI Ribosomal binding siteI Termination signal

GB

-35

PB

-10 +1

tss

SD

≈ +10

Gene

≈ +15-20

Terminator


READING FRAMES AND OVERLAPPING GENES

I RNA can be transcribed from either strandI Genes may start in different “reading frames”

I Genes can overlapI in the same and in different reading framesI on opposite strands



I Logic programming and PrologI Probabilistic logic programming and PRISM


LOGIC PROGRAMMING AND PROLOG

A Prolog program consist of a finite sequence of rules,

B:-A1, . . . ,An.

These rules define implications, i.e.,

B if A1 and . . . and An


TERMS, LITERALS AND VARIABLES

Literals can consist of (possibly) structured terms, that mayinclude variables.

number(0).number(s(X)) :- number(X).




fact





constant






term





variables


RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivation


RESOLUTION


number(X)



number(X).


. . .

Derivationnumber(X)→number(0)


RESOLUTION


number(X)



number(X).


. . .

Derivationnumber(X)→number(s(X))→


RESOLUTION


number(X)



number(X).


. . .



RESOLUTION


number(X)



number(X).


. . .

Derivationnumber(X)→number(s(X))→number(s(0))


RESOLUTION


number(X)



number(X).


. . .



RESOLUTION


number(X)



number(X).


. . .



RESOLUTION


number(X)



number(X).


. . .

Derivationnumber(X)→number(s(X))→number(s(s(X)))→


RESOLUTION


number(X)



number(X).


. . .

Derivationnumber(X)→number(s(X))→number(s(s(X)))→


RESOLUTION


number(X)



number(X).


. . .

Derivationnumber(X)→number(s(X))→number(s(s(X)))→number(s(s(0)))


DERIVATIONS AND EXPLANATION GRAPHS

Consider the followingprogram which adds naturalnumbers:

add(0+0,0).add(A+s(B),s(C)) :-

add(A+B,C).add(s(A)+B,s(C)) :-

add(A+B,C).

And suppose we call the goal,add(s(s(0))+s(s(0)),R)

We now have two alternativeapplicable clauses,

alternatives

Resulting in either,add(s(0)+s(s(0)),s(R))oradd(s(s(0))+s(0),s(R))


DERIVATIONS TREE

s(s(0))+s(s(0))

s(0)+s(s(0))

0+s(s(0))

0+s(0)

0+0

s(0)+s(0)

0+s(0)

0+0

s(0)+0

0+0

s(s(0))+s(0)

s(0)+s(0)

0+s(0)

0+0

s(0)+0

0+0

s(s(0))+0

s(0)+0

0+0


DERIVATIONS TREE

s(s(0))+s(s(0))

s(0)+s(s(0))

0+s(s(0))

0+s(0)

0+0

s(0)+s(0)

0+s(0)

0+0

s(0)+0

0+0

s(s(0))+s(0)

s(0)+s(0)

0+s(0)

0+0

s(0)+0

0+0

s(s(0))+0

s(0)+0

0+0Exponential!


EXPLANATION GRAPH

Polynomial1

1O(n ∗ m), but would be O(n + m) if arguments were ordered by size


TABLING

IdeaI The system maintains a table of calls and their answers.I when a new call is entered, check if it is stored in the tableI if so, use previously found solution

Consequence:I Explanation graph representation.I Significant speed-up of program execution.



Probabilistic logic programming is a form of logicprogramming which deals with uncertainty.

Assign probability to each possible derivation in a logicprogram.

Probabilistic inference, e.g.,I derive the probability of a goalI Infer the most probable derivation of a goalI Learn the affinities for different derivations from data


PRISM

I PRogramming In Statistical Modelling is a framework forprobabilistic logic programming

I Developed by collaboration partners of the Lost project:Yoshitaka Kameya, Taisuke Sato, and Neng-Fa Zhou.

I An extension of Prolog with random variables, called MSWs

I Provides efficient generalized inference algorithmsI PRISM program = probabilistic model


HIDDEN MARKOV MODEL EXAMPLE

PostcardGreetings from wherever, where I am havinga great time. Here is what I have been doing:The first two days, I stayed at the hotelreading a good book. Then, on the third day Idecided to go shopping. The next three days Idid nothing but lie on the beach. On my lastday, I went shopping for some gifts to bringhome and wrote you this postcard.

Sincerely, Some friend of yours

Observation sequence


HIDDEN MARKOV MODEL run

DefinitionA run of an HMM as a pair consisting of a sequence of statess(0)s(1) . . . s(n), called a path and a corresponding sequence ofemissions e(1) . . . e(n), called an observation, such thatI s(0) = s0;I ∀i, 0 ≤ i ≤ n− 1, p(s(i); s(i+1)) > 0

(probability to transit from s(i) to s(i+1));I ∀i, 0 < i ≤ n, p(s(i); e(i)) > 0

(probability to emit e(i) from s(i)).

DefinitionThe probability of such a run is defined as∏

i=1..n p(s(i−1); s(i)) · p(s(i); e(i))


DECODING WITH HIDDEN MARKOV MODELS

Infer the hidden path given the observation sequence.

argmaxpathP(path|observation)

source: wikipedia

→The Viterbi algorithm


EXAMPLE HMM IN PRISM

values/2declares theoutcomes ofrandom variables

msw/2simulates arandom variable,stochasticallyselecting one ofthe outcomes

Model in PrologSpecifies relationbetween variables

Example HMM in PRISM

values(trans(_), [sunny,rainy]).values(emit(_), [shop,beach,read]).

hmm(L):- run_length(T),hmm(T,start,L).

hmm(0,_,[]).hmm(T,State,[Emit|EmitRest]) :-

T > 0,msw(trans(State),NextState),msw(emit(NextState),Emit),T1 is T-1,hmm(T1,NextState,EmitRest).

run_length(7).











run_length(7).











run_length(7).


A PEEK INTO MY WORK

I Overview of papersI A few selected cases:

I An abstraction: Constrained HMMs (also an optimization)I An optimization: Regarding tabling of structured dataI A couple of applications: Genome models

I Using constrained probabilistic models for gene finding withoverlapping genes

I Gene finding with a probabilistic model forgenome-sequence of reading


PAPERS 1

1. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitTaming the Zoo of Discrete HMM Subspecies & some of their RelativesFrontiers in Artificial Intelligence and Applications, 2011

2. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitInference with constrained hidden Markov models in PRISMTheory and Practice of Logic Programming, 2010

3. Christian Theil HaveConstraints and Global Optimization for Gene Prediction OverlapResolutionWorkshop on Constraint Based Methods for Bioinformatics, 2011


PAPERS 2

4. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitThe Viterbi Algorithm expressed in Constraint Handling Rules7th International Workshop on Constraint Handling Rules, 2010

5. Christian Theil Have and Henning ChristiansenModeling Repeats in DNA Using Probabilistic Extended RegularExpressionsFrontiers in Artificial Intelligence and Applications, 2011

6. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitBayesian Annotation Networks for Complex Sequence AnalysisTechnical Communications of the 27th International Conferenceon Logic Programming (ICLP’11)


PAPERS 3

7. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitA declarative pipeline language for big data analysisPresented at LOPSTR, 2012

8. Christian Theil Have and Henning ChristiansenEfficient Tabling of Structured Data Using Indexing and ProgramTransformationPractical Aspects of Declarative Languages, 2012

9. Neng-Fa Zhou and Christian Theil HaveEfficient tabling of structured data with enhanced hash-consingTheory and Practice of Logic Programming, 2012


PAPERS 4

10. Christian Theil Have and Søren MørkA Probabilistic Genome-Wide Gene Reading Frame Sequence ModelSubmitted to PLOS One, 2012

11. Christian Theil Have, Sine Zambach and HenningChristiansenEffects of using Coding Potential, Sequence Conservation and mRNAStructure Conservation for Predicting Pyrrolysine Containing GenesSubmitted to BMC Bionformatics, 2012


THE TROUBLE WITH TABLING OF STRUCTURED DATA


THE TROUBLE WITH TABLING OF STRUCTURED DATA

An innocent lookingpredicate: last/2

last([X],X).last([_|L],X) :-

last(L,X).

I Traverses a list tofind the last element.

I Time/spacecomplexity: O(n).

I If we table last/2:n+ n− 1+ n− 2 . . . 1≈ O(n2) !

call:last([1,2,3,4,5],X)last([1,2,3,4,5],X) last([1,2,3,4],X) last([1,2,3],X) last([1,2],X) last([1],X)

call tablelast([1,2,3,4,5],X).last([1,2,3,4],X).last([1,2,3],X).last([1,2],X).last([1],X).


A WORKAROUND IMPLEMENTED IN PROLOG

We describe a workaround giving O(1) time and spacecomplexity for table lookups for programs with arbitrarilylarge ground structured data as input arguments.

I A term is represented as a set of facts.I A subterm is referenced by a unique integer serving as an

abstract pointer.I Matching related to tabling is done solely by comparison

of such pointers.


AN ABSTRACT DATA TYPE

store_term( +ground-term, pointer)The ground-term is any ground term, and thepointer returned is a unique reference (an integer)for that term.

retrieve_term( +pointer, ?functor, ?arg-pointers-list)Returns the functor and a list of pointers torepresentations of the substructures of the termrepresented by pointer.


ADT EXAMPLE

The following call converts the term f(a,g(b)) into itsinternal representation and returns a pointer value in thevariable P.

store_term(f(a,g(b)),P).

Implementation with assert, e.g.,

retrieve_term(100,f,[101,102]).retrieve_term(101,a,[]).retrieve_term(102,g,[103]).retrieve_term(103,b,[]).


AN AUTOMATIC PROGRAM TRANSFORMATION

We introduce an automatic transformation:

Structured terms are moved from the head of clauses to calls inthe body to retrieve_term/2.


TRANSFORMED HIDDEN MARKOV MODEL

original version

hmm(_,[]).

hmm(S,[Ob|Y]) :-msw(out(S),Ob),msw(tr(S),Next),hmm(Next,Y).

transformed version

hmm(S,ObsPtr):-retrieve_term(ObsPtr,[]).

hmm(S,ObsPtr) :-retrieve_term(ObsPtr,[Ob,Y]),msw(out(S),Ob),msw(tr(S),Next),hmm(Next,Y).


BENCHMARKING RESULTS

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 1000 2000 3000 4000 5000

020

4060

8010

012

014

0

b) Running time without indexed lookup

sequence length

Run

ning

tim

e (s

econ

ds)

●●

●●

● ●●

●●

●● ●

●●

●●

●

● ●

●

●●

● ●● ●

●●

●●

●●

●●

● ●

●

●

● ●●

●●

●●

● ● ●●

●

0 1000 2000 3000 4000 5000

0.00

0.02

0.04

0.06

0.08

a) Running time with indexed lookup

sequence length

Run

ning

tim

e (s

econ

ds)


THE NEXT STEP

Integration at the Prolog engine implementation level.

Neng-Fa Zhou and Christian Theil HaveEfficient tabling of structured data with enhanced hash-consingTheory and Practice of Logic Programming, 2012

I Full sharing between tables (call and answer)I Sharing with structured data in call stack


CONSTRAINED HMMS

DefinitionA constrained HMM (CHMM)I is an HMMI extended with a set of constraints C, each of which is a

mapping from HMM runs into {true, false}.


CONSTRAINED HMMS

Why extend an HMM with side-constraints?

I Convenient to express knowledge in terms of constraintsI Reuse underlying model with different assumptionsI Some constraints are not feasible as model structure (e.g.all_different)

I fewer paths to consider for any given sequence→ decreased running time (under certain conditions)


ALIGNMENT WITH A CONSTRAINED PAIR HMM

In a biological context, we may want toonly consider alignments with a limitednumber of insertions and deletions giventhe assumption that the two sequencesare closely related.


ALIGNMENT WITH CONSTRAINTS


ADDING CONSTRAINT CHECKING TO THE HMMHMM with constraint checking

hmm(T,State,[Emit|EmitRest],StoreIn) :-T > 0,msw(trans(State),NxtState),msw(emit(NxtState),Emit),check_constraints([NxtState,Emit],StoreIn,StoreOut),T1 is T-1,hmm(T1,NxtState,EmitRest,StoreOut).

I Call to check_constraints/3 after each distinctsequence of msw applications

I Side-constaints: The constraints are assumed to bedeclared elsewhere and not interleaved with modelspecification

I Extra Store argument in the probabilistic predicate


A LIBRARY OF GLOBAL CONSTRAINTS FOR HIDDEN

MARKOV MODELS

Our implementation contains a few well-known globalconstraints adapted to Hidden Markov Models.

Global constraintscardinality lock_to_sequenceall_different lock_to_set

In addition, the implementation provides operators which maybe used to apply constraints to a limited set of variables.

Constraint operatorsstate_specificemission_specificforall_subseq (sliding window operator)for_range (time step range operator)


TABLING ISSUES

Problem: The extra Store argument makes PRISM tablemultiple goals (for different constraint stores) when it shouldonly store one.

hmm(T,State,[Emit|EmitRest],Store)

To get rid of the extra argument, check_constraintsdynamically maintains it as a stack using assert/retract.

Note: This is not sound solution for all types of constraints(some need tabling).


IMPACT OF USING A SEPARATE CONSTRAINT STORE

STACK


GENOME MODELS

I Gene finding in a genomic contextI What are the constraints between adjacent genes in the

genome?I Extent of (possible) overlap

I Modeled as hard constraintsI Gene reading frames, i.e., due to leading strand bias,

operons etc.I Modeled as (probabilistic) soft constraints


AN APPLICATION OF CONSTRAINED MARKOV

MODELS

We wish to incorporate overlapping gene constraints into genefinding.Divide and conquer two step approach to gene finding:

1. Gene prediction: A gene finder supplies a set of candidatepredictions p1 . . . pn, called the initial set.

2. Pruning: The initial set is pruned according to certain rulesor constraints. We call the pruned set the final set.


PRUNING STEP AS A CONSTRAINT OPTIMIZATION

PROBLEM

CSP formulationWe introduce variables X = xi . . . xn corresponding to eachprediction p1 . . . pn in the initial set (sorted by position ingenome) All variables have boolean domains,∀xi ∈ X,D(xi) = {true, false} and xi = true⇔ pi ∈ final set.

I Multiple solutionsI We want the “best” solutionI Optimize for prediction confidence scoresI Constraint Optimization Problem (COP)

COP formulationLet the scores of p1 . . . pn be s1 . . . sn and si ∈ R+.Maximize

∑ni=1 si, subject to C.


COP IMPLEMENTATION WITH MARKOV CHAIN (1)We propose to use a (constrained) Markov chain for the COP.

I The Markov chain has a begin state, an end state and twostates for each variable xi corresponding to its booleandomain D(xi).

I The state corresponding to D(xi) = true is denoted αi andthe state corresponding to D(xi) = false is denoted βi.

I In this model, a path from the begin state to the end statecorresponds to a potential solution of the CSP.


FROM CONFIDENCE SCORES TO TRANSITION

PROBABILITIESI P(α1|begin) = σ1I P(β1|begin) = 1− σ1I P(end|αn) = P(end|βn) = 1.I P(αi|αi−1) = P(αi|βi−1) = σiI P(βi|αi−1) = P(βi|βi−1) = 1− σi

σi = 0.5 + λ+(0.5− λ)× (si −min(s1 . . . sn))

max(s1 . . . sn)−min(s1 . . . sn)


ENCODING CONSTRAINTS WITH CONSTRAINT

HANDLING RULESConstraints: alpha/2 and beta/2 ≈ visited states.

Example: Genemark inconsistency rules

alpha(Left1,Right1), alpha(Left2,Right2) <=>Left1 =< Left2, Right1 >= Right2 | fail.

beta(Left1,Right1), alpha(Left2,Right2) <=>Left1 =< Left2, Right1 >= Right2 | fail.

I The most probable consistent path is found using PRISMsgeneric adaptation of the Viterbi algorithm

I Each step adds either a alpha or beta (active) constraintI Incremental Pruning: For each step we only apply

constraints which may be transitively involved in ruleswith the active constraint


EXPERIMENTAL RESULTS

I Prediction on E.coli. using simplistic codon frequencybased gene finder.

I Pruning using our global optimization approach (with allinconsistency rules) versus local heuristic rules2.

Method #predictions Sensitivity Specificity Time (seconds)initial set 10799 0.7625 0.2926 na

Genemark rules 5823 0.7558 0.5379 1.4ECOGENE rules 4981 0.7148 0.5947 1.7

global optimization 5222 0.7201 0.5714 75

I Sensitivity = fraction of known reference genes predicted.

I Specificity = fraction of predicted genes that are correct.2Note that the results for the ECOGENE heuristic may vary depending on

execution strategy - in case of above results, predictions with lower leftposition are considered first.


A MODEL FOR THE GENOME-WIDE SEQUENCE OF

READING FRAMES

We wish to incorporate gene reading frame constraints intogene finding.Divide and conquer two step approach to gene finding (again):

1. Gene prediction: A gene finder supplies a set of candidatepredictions p1 . . . pn, called the initial set.

2. Pruning: The initial set is pruned according to gene finderconfidence scores and the the probabilities adjacent genereading frames. We call the pruned set the final set.


METHODOLOGY

I Genes predictions aresorted by stop codonposition.

I Gene finder scores arediscretized into symbolicvalues.

I A type of Hidden MarkovModel which we call adelete-HMM:

I A state for each of thesix possible readingframes and

I one delete state

F1

F2

F3

F4

F5

F6

delete


MODEL

I F1 . . . F6 Emissions: Finiteset of i symbols δ1 . . . δncorresponding to ranges ofprediction scores

I Delete state emission:P(δi|state = delete) =

FPδiFP

I Frame state transitions:Relative frequency of"observed" adjacent genereading frame pairs(normalized)

I Transition to delete:P(delete) = 1− TP

TP+FP(tunable)

F1

F2

F3

F4

F5

F6

delete


RESULTS

1−Specificity (FPR)

Sens

itivi

ty (T

PR)

0.0 0.1 0.2 0.3 0.4 0.5

0.5

0.6

0.7

0.8

0.9

1.0

thresholdframeseq, trained on Escherichiaframeseq, trained on Salmonellaframeseq, trained on Legionellaframeseq, trained on Bacillusframeseq, trained on Thermoplasma


CONCLUSIONS

1. To what extent is it possible to useprobabilistic logic programming forbiological sequence analysis?

2. How can constraints relevant to thedomain of biological sequence analysisbe combined with probabilistic logicprogramming?

3. What are the limitations with regard toefficiency and how can these be dealtwith?


CONCLUSIONS

I Commonly used models for biological analysis canconveniently expressed using probabilistic logicprogramming

I Probabilistic logic programming is also a powerful tool forexperimenting with new kinds of models

I It can support integration of constraints in a variety ofways

I Efficiency is an issue, but with suitable optimizations it isefficient enough for many interesting problems

I It is not merely a powerful abstractionI A valuable and practical tool for biological sequence

analysis


THANKS

Data & Analytics

Efficient Probabilistic Logic Programming for Biological Sequence Analysis