90
INTRODUCTION Background A peek into my work Conclusions Efficient Probabilistic Logic Programming for Biological Sequence Analysis Christian Theil Have Research group PLIS: Programming, Logic and Intelligent Systems Department of Communication, Business and Information Technologies Roskilde University

Efficient Probabilistic Logic Programming for Biological Sequence Analysis

Embed Size (px)

DESCRIPTION

Slides from Ph.D. defense

Citation preview

Page 1: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

EfficientProbabilistic Logic Programming

forBiological Sequence Analysis

Christian Theil HaveResearch group PLIS: Programming, Logic and Intelligent Systems

Department of Communication, Business and Information TechnologiesRoskilde University

February 6, 2013

Page 2: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

OUTLINE

INTRODUCTION

DomainResearch questions

BackgroundGene findingProbabilistic Logic Programming

A peek into my workOverview of papersThe trouble with tabling of structured dataConstrained HMMsApplications: Genome models

Conclusions

Page 3: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

INTRODUCTION

Page 4: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformatics

I Analyze biologicalsequences

I DNAI RNAI Proteins

I to understand

I FeaturesI FunctionsI Evolutionary

relationships

Page 5: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformaticsI Analyze biological

sequences

I DNAI RNAI Proteins

I to understand

I FeaturesI FunctionsI Evolutionary

relationships

Page 6: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformaticsI Analyze biological

sequencesI DNA

I RNAI Proteins

I to understand

I FeaturesI FunctionsI Evolutionary

relationships

Page 7: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformaticsI Analyze biological

sequencesI DNAI RNA

I ProteinsI to understand

I FeaturesI FunctionsI Evolutionary

relationships

Page 8: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformaticsI Analyze biological

sequencesI DNAI RNAI Proteins

I to understand

I FeaturesI FunctionsI Evolutionary

relationships

Page 9: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformaticsI Analyze biological

sequencesI DNAI RNAI Proteins

I to understand

I FeaturesI FunctionsI Evolutionary

relationships

Page 10: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformaticsI Analyze biological

sequencesI DNAI RNAI Proteins

I to understandI Features

I FunctionsI Evolutionary

relationships

Page 11: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformaticsI Analyze biological

sequencesI DNAI RNAI Proteins

I to understandI FeaturesI Functions

I Evolutionaryrelationships

Page 12: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BIOLOGICAL SEQUENCE ANALYSIS

I Subfield of bioinformaticsI Analyze biological

sequencesI DNAI RNAI Proteins

I to understandI FeaturesI FunctionsI Evolutionary

relationships

Page 13: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PROBABILISTIC LOGIC PROGRAMMING

I Declarative programming paradigmI Ability to express common and complex models used in

biological sequence analysisI Concise expression of complex modelsI Separation between logic and control

I Generic inference algorithmsI Transformations

Page 14: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

MODELS FOR BIOLOGICAL SEQUENCE ANALYSIS

I Reflect relationships between features of sequence dataI Embody constraints – assumptions about dataI Infer information from dataI Reasoning about uncertainty→ probabilities

Page 15: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

THE LOST PROJECT

. . . seeks to improve ease of modeling, accuracy and reliability ofsequence analysis by using logic-statistical models . . .

Key focus areas:

I The PRISM systemI Prokaryotic gene finding

My Ph.D. project is part of the LoSt project and share thesefocus areas.

Page 16: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESEARCH QUESTIONS

1. To what extent is it possible to use probabilistic logicprogramming for biological sequence analysis?

2. How can constraints relevant to the domain of biologicalsequence analysis be combined with probabilistic logicprogramming?

3. What are the limitations with regard to efficiency and how canthese be dealt with?

I believe that these are the central questions that need beaddressed in order to be able to construct useful tools forbiological sequence analysis using probabilistic logicprogramming.

Page 17: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RELATIONS BETWEEN RESEARCH QUESTIONS

1. To what extent is it possible to useprobabilistic logic programming forbiological sequence analysis?

2. How can constraints relevant to thedomain of biological sequence analysisbe combined with probabilistic logicprogramming?

3. What are the limitations with regard toefficiency and how can these be dealtwith?

Page 18: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

APPROACH

To build and evaluate

I ApplicationsI AbstractionsI Optimizations

for biological sequence analysis using probabilistic logicprogramming.

Page 19: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

APPROACH

I ApplicationsI Deal with relevant biological sequence analysis problemsI Potentially to contribute new knowledge to biology or

bioinformaticsI Direct substantiation with regard to research question 1

I AbstractionsI Ease modelingI Language for incorporating constraints from the domainI A higher level of declarativity;I Focus on problem rather than implementation (model)

detailsI Optimizations

I Deal with limitations of probabilistic logic programmingthat may hinder its use in biological sequence analysis.

I Efficient inference is a precondition for practical use.

Page 20: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BACKGROUND

I Prokaryotic gene findingI Probabilistic logic programming

Page 21: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PROKARYOTIC GENE FINDING

Identify regions of DNA which encode proteins:

A (prokaryotic) gene is a consecutive stretch of DNA which,I is transcribed as part of an RNAI is translated to a complete protein andI has a length which is a multiple of three (codons)I starts with a “start” codonI last codon is a “stop” codon

Page 22: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

GENES AND OPEN READING FRAMES

The identification of prokaryotic genes may be decomposedinto two distinct problems:

1. Identification of ORFs which contain protein coding genes.2. Identification of the correct start codon within an ORF.

〈ORF〉 ::= 〈start〉 〈not-stop〉* 〈stop〉〈start〉 ::= TTG | CTG | ATT | ATC | ATA | ATG | GTG〈stop〉 ::= TAA | TAG | TGA

〈not-stop〉 ::= AAA | ... | TTT //all codons except those in 〈stop〉

Page 23: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

SIGNALS FOR PROKARYOTIC GENE FINDING

I Open reading framesI LengthI Nucleotide sequence compositionI Conservation (sequence similarity in other organisms)

I Local contextI PromotersI Ribosomal binding siteI Termination signal

GB

-35

PB

-10 +1

tss

SD

≈ +10

Gene

≈ +15-20

Terminator

Page 24: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

READING FRAMES AND OVERLAPPING GENES

I RNA can be transcribed from either strandI Genes may start in different “reading frames”

I Genes can overlapI in the same and in different reading framesI on opposite strands

Page 25: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PROBABILISTIC LOGIC PROGRAMMING

I Logic programming and PrologI Probabilistic logic programming and PRISM

Page 26: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

LOGIC PROGRAMMING AND PROLOG

A Prolog program consist of a finite sequence of rules,

B:-A1, . . . ,An.

These rules define implications, i.e.,

B if A1 and . . . and An

Page 27: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

TERMS, LITERALS AND VARIABLES

Literals can consist of (possibly) structured terms, that mayinclude variables.

number(0).number(s(X)) :- number(X).

Page 28: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

TERMS, LITERALS AND VARIABLES

Literals can consist of (possibly) structured terms, that mayinclude variables.

fact

number(0).number(s(X)) :- number(X).

Page 29: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

TERMS, LITERALS AND VARIABLES

Literals can consist of (possibly) structured terms, that mayinclude variables.

constant

number(0).number(s(X)) :- number(X).

Page 30: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

TERMS, LITERALS AND VARIABLES

Literals can consist of (possibly) structured terms, that mayinclude variables.

number(0).number(s(X)) :- number(X).

term

Page 31: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

TERMS, LITERALS AND VARIABLES

Literals can consist of (possibly) structured terms, that mayinclude variables.

number(0).number(s(X)) :- number(X).

variables

Page 32: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivation

Page 33: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(0)

Page 34: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(s(X))→

Page 35: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(s(X))→

Page 36: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(s(X))→number(s(0))

Page 37: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(s(X))→

Page 38: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(s(X))→

Page 39: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(s(X))→number(s(s(X)))→

Page 40: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(s(X))→number(s(s(X)))→

Page 41: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESOLUTION

Problems are stated as theorems (goals) to be proved, e.g.,

number(X)

To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,

number(0).number(s(X)) :-

number(X).

Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))

. . .

Derivationnumber(X)→number(s(X))→number(s(s(X)))→number(s(s(0)))

Page 42: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

DERIVATIONS AND EXPLANATION GRAPHS

Consider the followingprogram which adds naturalnumbers:

add(0+0,0).add(A+s(B),s(C)) :-

add(A+B,C).add(s(A)+B,s(C)) :-

add(A+B,C).

And suppose we call the goal,add(s(s(0))+s(s(0)),R)

We now have two alternativeapplicable clauses,

alternatives

Resulting in either,add(s(0)+s(s(0)),s(R))oradd(s(s(0))+s(0),s(R))

Page 43: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

DERIVATIONS TREE

s(s(0))+s(s(0))

s(0)+s(s(0))

0+s(s(0))

0+s(0)

0+0

s(0)+s(0)

0+s(0)

0+0

s(0)+0

0+0

s(s(0))+s(0)

s(0)+s(0)

0+s(0)

0+0

s(0)+0

0+0

s(s(0))+0

s(0)+0

0+0

Page 44: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

DERIVATIONS TREE

s(s(0))+s(s(0))

s(0)+s(s(0))

0+s(s(0))

0+s(0)

0+0

s(0)+s(0)

0+s(0)

0+0

s(0)+0

0+0

s(s(0))+s(0)

s(0)+s(0)

0+s(0)

0+0

s(0)+0

0+0

s(s(0))+0

s(0)+0

0+0Exponential!

Page 45: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

EXPLANATION GRAPH

Polynomial1

1O(n ∗ m), but would be O(n + m) if arguments were ordered by size

Page 46: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

TABLING

IdeaI The system maintains a table of calls and their answers.I when a new call is entered, check if it is stored in the tableI if so, use previously found solution

Consequence:I Explanation graph representation.I Significant speed-up of program execution.

Page 47: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PROBABILISTIC LOGIC PROGRAMMING

Probabilistic logic programming is a form of logicprogramming which deals with uncertainty.

Assign probability to each possible derivation in a logicprogram.

Probabilistic inference, e.g.,I derive the probability of a goalI Infer the most probable derivation of a goalI Learn the affinities for different derivations from data

Page 48: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PRISM

I PRogramming In Statistical Modelling is a framework forprobabilistic logic programming

I Developed by collaboration partners of the Lost project:Yoshitaka Kameya, Taisuke Sato, and Neng-Fa Zhou.

I An extension of Prolog with random variables, called MSWs

I Provides efficient generalized inference algorithmsI PRISM program = probabilistic model

Page 49: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

HIDDEN MARKOV MODEL EXAMPLE

PostcardGreetings from wherever, where I am havinga great time. Here is what I have been doing:The first two days, I stayed at the hotelreading a good book. Then, on the third day Idecided to go shopping. The next three days Idid nothing but lie on the beach. On my lastday, I went shopping for some gifts to bringhome and wrote you this postcard.

Sincerely, Some friend of yours

Observation sequence

Page 50: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

HIDDEN MARKOV MODEL run

DefinitionA run of an HMM as a pair consisting of a sequence of statess(0)s(1) . . . s(n), called a path and a corresponding sequence ofemissions e(1) . . . e(n), called an observation, such thatI s(0) = s0;I ∀i, 0 ≤ i ≤ n− 1, p(s(i); s(i+1)) > 0

(probability to transit from s(i) to s(i+1));I ∀i, 0 < i ≤ n, p(s(i); e(i)) > 0

(probability to emit e(i) from s(i)).

DefinitionThe probability of such a run is defined as∏

i=1..n p(s(i−1); s(i)) · p(s(i); e(i))

Page 51: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

DECODING WITH HIDDEN MARKOV MODELS

Infer the hidden path given the observation sequence.

argmaxpathP(path|observation)

source: wikipedia

→The Viterbi algorithm

Page 52: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

EXAMPLE HMM IN PRISM

values/2declares theoutcomes ofrandom variables

msw/2simulates arandom variable,stochasticallyselecting one ofthe outcomes

Model in PrologSpecifies relationbetween variables

Example HMM in PRISM

values(trans(_), [sunny,rainy]).values(emit(_), [shop,beach,read]).

hmm(L):- run_length(T),hmm(T,start,L).

hmm(0,_,[]).hmm(T,State,[Emit|EmitRest]) :-

T > 0,msw(trans(State),NextState),msw(emit(NextState),Emit),T1 is T-1,hmm(T1,NextState,EmitRest).

run_length(7).

Page 53: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

EXAMPLE HMM IN PRISM

values/2declares theoutcomes ofrandom variables

msw/2simulates arandom variable,stochasticallyselecting one ofthe outcomes

Model in PrologSpecifies relationbetween variables

Example HMM in PRISM

values(trans(_), [sunny,rainy]).values(emit(_), [shop,beach,read]).

hmm(L):- run_length(T),hmm(T,start,L).

hmm(0,_,[]).hmm(T,State,[Emit|EmitRest]) :-

T > 0,msw(trans(State),NextState),msw(emit(NextState),Emit),T1 is T-1,hmm(T1,NextState,EmitRest).

run_length(7).

Page 54: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

EXAMPLE HMM IN PRISM

values/2declares theoutcomes ofrandom variables

msw/2simulates arandom variable,stochasticallyselecting one ofthe outcomes

Model in PrologSpecifies relationbetween variables

Example HMM in PRISM

values(trans(_), [sunny,rainy]).values(emit(_), [shop,beach,read]).

hmm(L):- run_length(T),hmm(T,start,L).

hmm(0,_,[]).hmm(T,State,[Emit|EmitRest]) :-

T > 0,msw(trans(State),NextState),msw(emit(NextState),Emit),T1 is T-1,hmm(T1,NextState,EmitRest).

run_length(7).

Page 55: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

A PEEK INTO MY WORK

I Overview of papersI A few selected cases:

I An abstraction: Constrained HMMs (also an optimization)I An optimization: Regarding tabling of structured dataI A couple of applications: Genome models

I Using constrained probabilistic models for gene finding withoverlapping genes

I Gene finding with a probabilistic model forgenome-sequence of reading

Page 56: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PAPERS 1

1. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitTaming the Zoo of Discrete HMM Subspecies & some of their RelativesFrontiers in Artificial Intelligence and Applications, 2011

2. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitInference with constrained hidden Markov models in PRISMTheory and Practice of Logic Programming, 2010

3. Christian Theil HaveConstraints and Global Optimization for Gene Prediction OverlapResolutionWorkshop on Constraint Based Methods for Bioinformatics, 2011

Page 57: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PAPERS 2

4. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitThe Viterbi Algorithm expressed in Constraint Handling Rules7th International Workshop on Constraint Handling Rules, 2010

5. Christian Theil Have and Henning ChristiansenModeling Repeats in DNA Using Probabilistic Extended RegularExpressionsFrontiers in Artificial Intelligence and Applications, 2011

6. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitBayesian Annotation Networks for Complex Sequence AnalysisTechnical Communications of the 27th International Conferenceon Logic Programming (ICLP’11)

Page 58: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PAPERS 3

7. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitA declarative pipeline language for big data analysisPresented at LOPSTR, 2012

8. Christian Theil Have and Henning ChristiansenEfficient Tabling of Structured Data Using Indexing and ProgramTransformationPractical Aspects of Declarative Languages, 2012

9. Neng-Fa Zhou and Christian Theil HaveEfficient tabling of structured data with enhanced hash-consingTheory and Practice of Logic Programming, 2012

Page 59: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PAPERS 4

10. Christian Theil Have and Søren MørkA Probabilistic Genome-Wide Gene Reading Frame Sequence ModelSubmitted to PLOS One, 2012

11. Christian Theil Have, Sine Zambach and HenningChristiansenEffects of using Coding Potential, Sequence Conservation and mRNAStructure Conservation for Predicting Pyrrolysine Containing GenesSubmitted to BMC Bionformatics, 2012

Page 60: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

THE TROUBLE WITH TABLING OF STRUCTURED DATA

Page 61: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

THE TROUBLE WITH TABLING OF STRUCTURED DATA

An innocent lookingpredicate: last/2

last([X],X).last([_|L],X) :-

last(L,X).

I Traverses a list tofind the last element.

I Time/spacecomplexity: O(n).

I If we table last/2:n+ n− 1+ n− 2 . . . 1≈ O(n2) !

call:last([1,2,3,4,5],X)last([1,2,3,4,5],X) last([1,2,3,4],X) last([1,2,3],X) last([1,2],X) last([1],X)

call tablelast([1,2,3,4,5],X).last([1,2,3,4],X).last([1,2,3],X).last([1,2],X).last([1],X).

Page 62: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

A WORKAROUND IMPLEMENTED IN PROLOG

We describe a workaround giving O(1) time and spacecomplexity for table lookups for programs with arbitrarilylarge ground structured data as input arguments.

I A term is represented as a set of facts.I A subterm is referenced by a unique integer serving as an

abstract pointer.I Matching related to tabling is done solely by comparison

of such pointers.

Page 63: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

AN ABSTRACT DATA TYPE

store_term( +ground-term, pointer)The ground-term is any ground term, and thepointer returned is a unique reference (an integer)for that term.

retrieve_term( +pointer, ?functor, ?arg-pointers-list)Returns the functor and a list of pointers torepresentations of the substructures of the termrepresented by pointer.

Page 64: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

ADT EXAMPLE

The following call converts the term f(a,g(b)) into itsinternal representation and returns a pointer value in thevariable P.

store_term(f(a,g(b)),P).

Implementation with assert, e.g.,

retrieve_term(100,f,[101,102]).retrieve_term(101,a,[]).retrieve_term(102,g,[103]).retrieve_term(103,b,[]).

Page 65: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

AN AUTOMATIC PROGRAM TRANSFORMATION

We introduce an automatic transformation:

Structured terms are moved from the head of clauses to calls inthe body to retrieve_term/2.

Page 66: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

TRANSFORMED HIDDEN MARKOV MODEL

original version

hmm(_,[]).

hmm(S,[Ob|Y]) :-msw(out(S),Ob),msw(tr(S),Next),hmm(Next,Y).

transformed version

hmm(S,ObsPtr):-retrieve_term(ObsPtr,[]).

hmm(S,ObsPtr) :-retrieve_term(ObsPtr,[Ob,Y]),msw(out(S),Ob),msw(tr(S),Next),hmm(Next,Y).

Page 67: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

BENCHMARKING RESULTS

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

●●

0 1000 2000 3000 4000 5000

020

4060

8010

012

014

0

b) Running time without indexed lookup

sequence length

Run

ning

tim

e (s

econ

ds)

●●

●●

● ●●

●●

●● ●

●●

●●

● ●

●●

● ●● ●

●●

●●

●●

●●

● ●

● ●●

●●

●●

● ● ●●

0 1000 2000 3000 4000 5000

0.00

0.02

0.04

0.06

0.08

a) Running time with indexed lookup

sequence length

Run

ning

tim

e (s

econ

ds)

Page 68: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

THE NEXT STEP

Integration at the Prolog engine implementation level.

Neng-Fa Zhou and Christian Theil HaveEfficient tabling of structured data with enhanced hash-consingTheory and Practice of Logic Programming, 2012

I Full sharing between tables (call and answer)I Sharing with structured data in call stack

Page 69: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

CONSTRAINED HMMS

DefinitionA constrained HMM (CHMM)I is an HMMI extended with a set of constraints C, each of which is a

mapping from HMM runs into {true, false}.

Page 70: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

CONSTRAINED HMMS

Why extend an HMM with side-constraints?

I Convenient to express knowledge in terms of constraintsI Reuse underlying model with different assumptionsI Some constraints are not feasible as model structure (e.g.all_different)

I fewer paths to consider for any given sequence→ decreased running time (under certain conditions)

Page 71: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

ALIGNMENT WITH A CONSTRAINED PAIR HMM

In a biological context, we may want toonly consider alignments with a limitednumber of insertions and deletions giventhe assumption that the two sequencesare closely related.

Page 72: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

ALIGNMENT WITH CONSTRAINTS

Page 73: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

ADDING CONSTRAINT CHECKING TO THE HMMHMM with constraint checking

hmm(T,State,[Emit|EmitRest],StoreIn) :-T > 0,msw(trans(State),NxtState),msw(emit(NxtState),Emit),check_constraints([NxtState,Emit],StoreIn,StoreOut),T1 is T-1,hmm(T1,NxtState,EmitRest,StoreOut).

I Call to check_constraints/3 after each distinctsequence of msw applications

I Side-constaints: The constraints are assumed to bedeclared elsewhere and not interleaved with modelspecification

I Extra Store argument in the probabilistic predicate

Page 74: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

A LIBRARY OF GLOBAL CONSTRAINTS FOR HIDDEN

MARKOV MODELS

Our implementation contains a few well-known globalconstraints adapted to Hidden Markov Models.

Global constraintscardinality lock_to_sequenceall_different lock_to_set

In addition, the implementation provides operators which maybe used to apply constraints to a limited set of variables.

Constraint operatorsstate_specificemission_specificforall_subseq (sliding window operator)for_range (time step range operator)

Page 75: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

TABLING ISSUES

Problem: The extra Store argument makes PRISM tablemultiple goals (for different constraint stores) when it shouldonly store one.

hmm(T,State,[Emit|EmitRest],Store)

To get rid of the extra argument, check_constraintsdynamically maintains it as a stack using assert/retract.

Note: This is not sound solution for all types of constraints(some need tabling).

Page 76: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

IMPACT OF USING A SEPARATE CONSTRAINT STORE

STACK

Page 77: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

GENOME MODELS

I Gene finding in a genomic contextI What are the constraints between adjacent genes in the

genome?I Extent of (possible) overlap

I Modeled as hard constraintsI Gene reading frames, i.e., due to leading strand bias,

operons etc.I Modeled as (probabilistic) soft constraints

Page 78: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

AN APPLICATION OF CONSTRAINED MARKOV

MODELS

We wish to incorporate overlapping gene constraints into genefinding.Divide and conquer two step approach to gene finding:

1. Gene prediction: A gene finder supplies a set of candidatepredictions p1 . . . pn, called the initial set.

2. Pruning: The initial set is pruned according to certain rulesor constraints. We call the pruned set the final set.

Page 79: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

PRUNING STEP AS A CONSTRAINT OPTIMIZATION

PROBLEM

CSP formulationWe introduce variables X = xi . . . xn corresponding to eachprediction p1 . . . pn in the initial set (sorted by position ingenome) All variables have boolean domains,∀xi ∈ X,D(xi) = {true, false} and xi = true⇔ pi ∈ final set.

I Multiple solutionsI We want the “best” solutionI Optimize for prediction confidence scoresI Constraint Optimization Problem (COP)

COP formulationLet the scores of p1 . . . pn be s1 . . . sn and si ∈ R+.Maximize

∑ni=1 si, subject to C.

Page 80: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

COP IMPLEMENTATION WITH MARKOV CHAIN (1)We propose to use a (constrained) Markov chain for the COP.

I The Markov chain has a begin state, an end state and twostates for each variable xi corresponding to its booleandomain D(xi).

I The state corresponding to D(xi) = true is denoted αi andthe state corresponding to D(xi) = false is denoted βi.

I In this model, a path from the begin state to the end statecorresponds to a potential solution of the CSP.

Page 81: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

FROM CONFIDENCE SCORES TO TRANSITION

PROBABILITIESI P(α1|begin) = σ1I P(β1|begin) = 1− σ1I P(end|αn) = P(end|βn) = 1.I P(αi|αi−1) = P(αi|βi−1) = σiI P(βi|αi−1) = P(βi|βi−1) = 1− σi

σi = 0.5 + λ+(0.5− λ)× (si −min(s1 . . . sn))

max(s1 . . . sn)−min(s1 . . . sn)

Page 82: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

ENCODING CONSTRAINTS WITH CONSTRAINT

HANDLING RULESConstraints: alpha/2 and beta/2 ≈ visited states.

Example: Genemark inconsistency rules

alpha(Left1,Right1), alpha(Left2,Right2) <=>Left1 =< Left2, Right1 >= Right2 | fail.

beta(Left1,Right1), alpha(Left2,Right2) <=>Left1 =< Left2, Right1 >= Right2 | fail.

I The most probable consistent path is found using PRISMsgeneric adaptation of the Viterbi algorithm

I Each step adds either a alpha or beta (active) constraintI Incremental Pruning: For each step we only apply

constraints which may be transitively involved in ruleswith the active constraint

Page 83: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

EXPERIMENTAL RESULTS

I Prediction on E.coli. using simplistic codon frequencybased gene finder.

I Pruning using our global optimization approach (with allinconsistency rules) versus local heuristic rules2.

Method #predictions Sensitivity Specificity Time (seconds)initial set 10799 0.7625 0.2926 na

Genemark rules 5823 0.7558 0.5379 1.4ECOGENE rules 4981 0.7148 0.5947 1.7

global optimization 5222 0.7201 0.5714 75

I Sensitivity = fraction of known reference genes predicted.

I Specificity = fraction of predicted genes that are correct.2Note that the results for the ECOGENE heuristic may vary depending on

execution strategy - in case of above results, predictions with lower leftposition are considered first.

Page 84: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

A MODEL FOR THE GENOME-WIDE SEQUENCE OF

READING FRAMES

We wish to incorporate gene reading frame constraints intogene finding.Divide and conquer two step approach to gene finding (again):

1. Gene prediction: A gene finder supplies a set of candidatepredictions p1 . . . pn, called the initial set.

2. Pruning: The initial set is pruned according to gene finderconfidence scores and the the probabilities adjacent genereading frames. We call the pruned set the final set.

Page 85: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

METHODOLOGY

I Genes predictions aresorted by stop codonposition.

I Gene finder scores arediscretized into symbolicvalues.

I A type of Hidden MarkovModel which we call adelete-HMM:

I A state for each of thesix possible readingframes and

I one delete state

F1

F2

F3

F4

F5

F6

delete

Page 86: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

MODEL

I F1 . . . F6 Emissions: Finiteset of i symbols δ1 . . . δncorresponding to ranges ofprediction scores

I Delete state emission:P(δi|state = delete) =

FPδiFP

I Frame state transitions:Relative frequency of"observed" adjacent genereading frame pairs(normalized)

I Transition to delete:P(delete) = 1− TP

TP+FP(tunable)

F1

F2

F3

F4

F5

F6

delete

Page 87: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

RESULTS

1−Specificity (FPR)

Sens

itivi

ty (T

PR)

0.0 0.1 0.2 0.3 0.4 0.5

0.5

0.6

0.7

0.8

0.9

1.0

thresholdframeseq, trained on Escherichiaframeseq, trained on Salmonellaframeseq, trained on Legionellaframeseq, trained on Bacillusframeseq, trained on Thermoplasma

Page 88: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

CONCLUSIONS

1. To what extent is it possible to useprobabilistic logic programming forbiological sequence analysis?

2. How can constraints relevant to thedomain of biological sequence analysisbe combined with probabilistic logicprogramming?

3. What are the limitations with regard toefficiency and how can these be dealtwith?

Page 89: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

CONCLUSIONS

I Commonly used models for biological analysis canconveniently expressed using probabilistic logicprogramming

I Probabilistic logic programming is also a powerful tool forexperimenting with new kinds of models

I It can support integration of constraints in a variety ofways

I Efficiency is an issue, but with suitable optimizations it isefficient enough for many interesting problems

I It is not merely a powerful abstractionI A valuable and practical tool for biological sequence

analysis

Page 90: Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions

THANKS