Upload
christian-have
View
107
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slides from Ph.D. defense
Citation preview
INTRODUCTION Background A peek into my work Conclusions
EfficientProbabilistic Logic Programming
forBiological Sequence Analysis
Christian Theil HaveResearch group PLIS: Programming, Logic and Intelligent Systems
Department of Communication, Business and Information TechnologiesRoskilde University
February 6, 2013
INTRODUCTION Background A peek into my work Conclusions
OUTLINE
INTRODUCTION
DomainResearch questions
BackgroundGene findingProbabilistic Logic Programming
A peek into my workOverview of papersThe trouble with tabling of structured dataConstrained HMMsApplications: Genome models
Conclusions
INTRODUCTION Background A peek into my work Conclusions
INTRODUCTION
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformatics
I Analyze biologicalsequences
I DNAI RNAI Proteins
I to understand
I FeaturesI FunctionsI Evolutionary
relationships
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformaticsI Analyze biological
sequences
I DNAI RNAI Proteins
I to understand
I FeaturesI FunctionsI Evolutionary
relationships
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformaticsI Analyze biological
sequencesI DNA
I RNAI Proteins
I to understand
I FeaturesI FunctionsI Evolutionary
relationships
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformaticsI Analyze biological
sequencesI DNAI RNA
I ProteinsI to understand
I FeaturesI FunctionsI Evolutionary
relationships
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformaticsI Analyze biological
sequencesI DNAI RNAI Proteins
I to understand
I FeaturesI FunctionsI Evolutionary
relationships
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformaticsI Analyze biological
sequencesI DNAI RNAI Proteins
I to understand
I FeaturesI FunctionsI Evolutionary
relationships
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformaticsI Analyze biological
sequencesI DNAI RNAI Proteins
I to understandI Features
I FunctionsI Evolutionary
relationships
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformaticsI Analyze biological
sequencesI DNAI RNAI Proteins
I to understandI FeaturesI Functions
I Evolutionaryrelationships
INTRODUCTION Background A peek into my work Conclusions
BIOLOGICAL SEQUENCE ANALYSIS
I Subfield of bioinformaticsI Analyze biological
sequencesI DNAI RNAI Proteins
I to understandI FeaturesI FunctionsI Evolutionary
relationships
INTRODUCTION Background A peek into my work Conclusions
PROBABILISTIC LOGIC PROGRAMMING
I Declarative programming paradigmI Ability to express common and complex models used in
biological sequence analysisI Concise expression of complex modelsI Separation between logic and control
I Generic inference algorithmsI Transformations
INTRODUCTION Background A peek into my work Conclusions
MODELS FOR BIOLOGICAL SEQUENCE ANALYSIS
I Reflect relationships between features of sequence dataI Embody constraints – assumptions about dataI Infer information from dataI Reasoning about uncertainty→ probabilities
INTRODUCTION Background A peek into my work Conclusions
THE LOST PROJECT
. . . seeks to improve ease of modeling, accuracy and reliability ofsequence analysis by using logic-statistical models . . .
Key focus areas:
I The PRISM systemI Prokaryotic gene finding
My Ph.D. project is part of the LoSt project and share thesefocus areas.
INTRODUCTION Background A peek into my work Conclusions
RESEARCH QUESTIONS
1. To what extent is it possible to use probabilistic logicprogramming for biological sequence analysis?
2. How can constraints relevant to the domain of biologicalsequence analysis be combined with probabilistic logicprogramming?
3. What are the limitations with regard to efficiency and how canthese be dealt with?
I believe that these are the central questions that need beaddressed in order to be able to construct useful tools forbiological sequence analysis using probabilistic logicprogramming.
INTRODUCTION Background A peek into my work Conclusions
RELATIONS BETWEEN RESEARCH QUESTIONS
1. To what extent is it possible to useprobabilistic logic programming forbiological sequence analysis?
2. How can constraints relevant to thedomain of biological sequence analysisbe combined with probabilistic logicprogramming?
3. What are the limitations with regard toefficiency and how can these be dealtwith?
INTRODUCTION Background A peek into my work Conclusions
APPROACH
To build and evaluate
I ApplicationsI AbstractionsI Optimizations
for biological sequence analysis using probabilistic logicprogramming.
INTRODUCTION Background A peek into my work Conclusions
APPROACH
I ApplicationsI Deal with relevant biological sequence analysis problemsI Potentially to contribute new knowledge to biology or
bioinformaticsI Direct substantiation with regard to research question 1
I AbstractionsI Ease modelingI Language for incorporating constraints from the domainI A higher level of declarativity;I Focus on problem rather than implementation (model)
detailsI Optimizations
I Deal with limitations of probabilistic logic programmingthat may hinder its use in biological sequence analysis.
I Efficient inference is a precondition for practical use.
INTRODUCTION Background A peek into my work Conclusions
BACKGROUND
I Prokaryotic gene findingI Probabilistic logic programming
INTRODUCTION Background A peek into my work Conclusions
PROKARYOTIC GENE FINDING
Identify regions of DNA which encode proteins:
A (prokaryotic) gene is a consecutive stretch of DNA which,I is transcribed as part of an RNAI is translated to a complete protein andI has a length which is a multiple of three (codons)I starts with a “start” codonI last codon is a “stop” codon
INTRODUCTION Background A peek into my work Conclusions
GENES AND OPEN READING FRAMES
The identification of prokaryotic genes may be decomposedinto two distinct problems:
1. Identification of ORFs which contain protein coding genes.2. Identification of the correct start codon within an ORF.
〈ORF〉 ::= 〈start〉 〈not-stop〉* 〈stop〉〈start〉 ::= TTG | CTG | ATT | ATC | ATA | ATG | GTG〈stop〉 ::= TAA | TAG | TGA
〈not-stop〉 ::= AAA | ... | TTT //all codons except those in 〈stop〉
INTRODUCTION Background A peek into my work Conclusions
SIGNALS FOR PROKARYOTIC GENE FINDING
I Open reading framesI LengthI Nucleotide sequence compositionI Conservation (sequence similarity in other organisms)
I Local contextI PromotersI Ribosomal binding siteI Termination signal
GB
-35
PB
-10 +1
tss
SD
≈ +10
Gene
≈ +15-20
Terminator
INTRODUCTION Background A peek into my work Conclusions
READING FRAMES AND OVERLAPPING GENES
I RNA can be transcribed from either strandI Genes may start in different “reading frames”
I Genes can overlapI in the same and in different reading framesI on opposite strands
INTRODUCTION Background A peek into my work Conclusions
PROBABILISTIC LOGIC PROGRAMMING
I Logic programming and PrologI Probabilistic logic programming and PRISM
INTRODUCTION Background A peek into my work Conclusions
LOGIC PROGRAMMING AND PROLOG
A Prolog program consist of a finite sequence of rules,
B:-A1, . . . ,An.
These rules define implications, i.e.,
B if A1 and . . . and An
INTRODUCTION Background A peek into my work Conclusions
TERMS, LITERALS AND VARIABLES
Literals can consist of (possibly) structured terms, that mayinclude variables.
number(0).number(s(X)) :- number(X).
INTRODUCTION Background A peek into my work Conclusions
TERMS, LITERALS AND VARIABLES
Literals can consist of (possibly) structured terms, that mayinclude variables.
fact
number(0).number(s(X)) :- number(X).
INTRODUCTION Background A peek into my work Conclusions
TERMS, LITERALS AND VARIABLES
Literals can consist of (possibly) structured terms, that mayinclude variables.
constant
number(0).number(s(X)) :- number(X).
INTRODUCTION Background A peek into my work Conclusions
TERMS, LITERALS AND VARIABLES
Literals can consist of (possibly) structured terms, that mayinclude variables.
number(0).number(s(X)) :- number(X).
term
INTRODUCTION Background A peek into my work Conclusions
TERMS, LITERALS AND VARIABLES
Literals can consist of (possibly) structured terms, that mayinclude variables.
number(0).number(s(X)) :- number(X).
variables
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivation
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(0)
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(s(X))→
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(s(X))→
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(s(X))→number(s(0))
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(s(X))→
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(s(X))→
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(s(X))→number(s(s(X)))→
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(s(X))→number(s(s(X)))→
INTRODUCTION Background A peek into my work Conclusions
RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove theantecedents by using rules where these appear as consequents,
number(0).number(s(X)) :-
number(X).
Solutionsnumber(X)→ X = 0number(X)→ X = s(0)number(X)→ X = s(s(0))
. . .
Derivationnumber(X)→number(s(X))→number(s(s(X)))→number(s(s(0)))
INTRODUCTION Background A peek into my work Conclusions
DERIVATIONS AND EXPLANATION GRAPHS
Consider the followingprogram which adds naturalnumbers:
add(0+0,0).add(A+s(B),s(C)) :-
add(A+B,C).add(s(A)+B,s(C)) :-
add(A+B,C).
And suppose we call the goal,add(s(s(0))+s(s(0)),R)
We now have two alternativeapplicable clauses,
alternatives
Resulting in either,add(s(0)+s(s(0)),s(R))oradd(s(s(0))+s(0),s(R))
INTRODUCTION Background A peek into my work Conclusions
DERIVATIONS TREE
s(s(0))+s(s(0))
s(0)+s(s(0))
0+s(s(0))
0+s(0)
0+0
s(0)+s(0)
0+s(0)
0+0
s(0)+0
0+0
s(s(0))+s(0)
s(0)+s(0)
0+s(0)
0+0
s(0)+0
0+0
s(s(0))+0
s(0)+0
0+0
INTRODUCTION Background A peek into my work Conclusions
DERIVATIONS TREE
s(s(0))+s(s(0))
s(0)+s(s(0))
0+s(s(0))
0+s(0)
0+0
s(0)+s(0)
0+s(0)
0+0
s(0)+0
0+0
s(s(0))+s(0)
s(0)+s(0)
0+s(0)
0+0
s(0)+0
0+0
s(s(0))+0
s(0)+0
0+0Exponential!
INTRODUCTION Background A peek into my work Conclusions
EXPLANATION GRAPH
Polynomial1
1O(n ∗ m), but would be O(n + m) if arguments were ordered by size
INTRODUCTION Background A peek into my work Conclusions
TABLING
IdeaI The system maintains a table of calls and their answers.I when a new call is entered, check if it is stored in the tableI if so, use previously found solution
Consequence:I Explanation graph representation.I Significant speed-up of program execution.
INTRODUCTION Background A peek into my work Conclusions
PROBABILISTIC LOGIC PROGRAMMING
Probabilistic logic programming is a form of logicprogramming which deals with uncertainty.
Assign probability to each possible derivation in a logicprogram.
Probabilistic inference, e.g.,I derive the probability of a goalI Infer the most probable derivation of a goalI Learn the affinities for different derivations from data
INTRODUCTION Background A peek into my work Conclusions
PRISM
I PRogramming In Statistical Modelling is a framework forprobabilistic logic programming
I Developed by collaboration partners of the Lost project:Yoshitaka Kameya, Taisuke Sato, and Neng-Fa Zhou.
I An extension of Prolog with random variables, called MSWs
I Provides efficient generalized inference algorithmsI PRISM program = probabilistic model
INTRODUCTION Background A peek into my work Conclusions
HIDDEN MARKOV MODEL EXAMPLE
PostcardGreetings from wherever, where I am havinga great time. Here is what I have been doing:The first two days, I stayed at the hotelreading a good book. Then, on the third day Idecided to go shopping. The next three days Idid nothing but lie on the beach. On my lastday, I went shopping for some gifts to bringhome and wrote you this postcard.
Sincerely, Some friend of yours
Observation sequence
INTRODUCTION Background A peek into my work Conclusions
HIDDEN MARKOV MODEL run
DefinitionA run of an HMM as a pair consisting of a sequence of statess(0)s(1) . . . s(n), called a path and a corresponding sequence ofemissions e(1) . . . e(n), called an observation, such thatI s(0) = s0;I ∀i, 0 ≤ i ≤ n− 1, p(s(i); s(i+1)) > 0
(probability to transit from s(i) to s(i+1));I ∀i, 0 < i ≤ n, p(s(i); e(i)) > 0
(probability to emit e(i) from s(i)).
DefinitionThe probability of such a run is defined as∏
i=1..n p(s(i−1); s(i)) · p(s(i); e(i))
INTRODUCTION Background A peek into my work Conclusions
DECODING WITH HIDDEN MARKOV MODELS
Infer the hidden path given the observation sequence.
argmaxpathP(path|observation)
source: wikipedia
→The Viterbi algorithm
INTRODUCTION Background A peek into my work Conclusions
EXAMPLE HMM IN PRISM
values/2declares theoutcomes ofrandom variables
msw/2simulates arandom variable,stochasticallyselecting one ofthe outcomes
Model in PrologSpecifies relationbetween variables
Example HMM in PRISM
values(trans(_), [sunny,rainy]).values(emit(_), [shop,beach,read]).
hmm(L):- run_length(T),hmm(T,start,L).
hmm(0,_,[]).hmm(T,State,[Emit|EmitRest]) :-
T > 0,msw(trans(State),NextState),msw(emit(NextState),Emit),T1 is T-1,hmm(T1,NextState,EmitRest).
run_length(7).
INTRODUCTION Background A peek into my work Conclusions
EXAMPLE HMM IN PRISM
values/2declares theoutcomes ofrandom variables
msw/2simulates arandom variable,stochasticallyselecting one ofthe outcomes
Model in PrologSpecifies relationbetween variables
Example HMM in PRISM
values(trans(_), [sunny,rainy]).values(emit(_), [shop,beach,read]).
hmm(L):- run_length(T),hmm(T,start,L).
hmm(0,_,[]).hmm(T,State,[Emit|EmitRest]) :-
T > 0,msw(trans(State),NextState),msw(emit(NextState),Emit),T1 is T-1,hmm(T1,NextState,EmitRest).
run_length(7).
INTRODUCTION Background A peek into my work Conclusions
EXAMPLE HMM IN PRISM
values/2declares theoutcomes ofrandom variables
msw/2simulates arandom variable,stochasticallyselecting one ofthe outcomes
Model in PrologSpecifies relationbetween variables
Example HMM in PRISM
values(trans(_), [sunny,rainy]).values(emit(_), [shop,beach,read]).
hmm(L):- run_length(T),hmm(T,start,L).
hmm(0,_,[]).hmm(T,State,[Emit|EmitRest]) :-
T > 0,msw(trans(State),NextState),msw(emit(NextState),Emit),T1 is T-1,hmm(T1,NextState,EmitRest).
run_length(7).
INTRODUCTION Background A peek into my work Conclusions
A PEEK INTO MY WORK
I Overview of papersI A few selected cases:
I An abstraction: Constrained HMMs (also an optimization)I An optimization: Regarding tabling of structured dataI A couple of applications: Genome models
I Using constrained probabilistic models for gene finding withoverlapping genes
I Gene finding with a probabilistic model forgenome-sequence of reading
INTRODUCTION Background A peek into my work Conclusions
PAPERS 1
1. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitTaming the Zoo of Discrete HMM Subspecies & some of their RelativesFrontiers in Artificial Intelligence and Applications, 2011
2. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitInference with constrained hidden Markov models in PRISMTheory and Practice of Logic Programming, 2010
3. Christian Theil HaveConstraints and Global Optimization for Gene Prediction OverlapResolutionWorkshop on Constraint Based Methods for Bioinformatics, 2011
INTRODUCTION Background A peek into my work Conclusions
PAPERS 2
4. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitThe Viterbi Algorithm expressed in Constraint Handling Rules7th International Workshop on Constraint Handling Rules, 2010
5. Christian Theil Have and Henning ChristiansenModeling Repeats in DNA Using Probabilistic Extended RegularExpressionsFrontiers in Artificial Intelligence and Applications, 2011
6. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitBayesian Annotation Networks for Complex Sequence AnalysisTechnical Communications of the 27th International Conferenceon Logic Programming (ICLP’11)
INTRODUCTION Background A peek into my work Conclusions
PAPERS 3
7. Henning Christiansen, Christian Theil Have, Ole Torp Lassenand Matthieu PetitA declarative pipeline language for big data analysisPresented at LOPSTR, 2012
8. Christian Theil Have and Henning ChristiansenEfficient Tabling of Structured Data Using Indexing and ProgramTransformationPractical Aspects of Declarative Languages, 2012
9. Neng-Fa Zhou and Christian Theil HaveEfficient tabling of structured data with enhanced hash-consingTheory and Practice of Logic Programming, 2012
INTRODUCTION Background A peek into my work Conclusions
PAPERS 4
10. Christian Theil Have and Søren MørkA Probabilistic Genome-Wide Gene Reading Frame Sequence ModelSubmitted to PLOS One, 2012
11. Christian Theil Have, Sine Zambach and HenningChristiansenEffects of using Coding Potential, Sequence Conservation and mRNAStructure Conservation for Predicting Pyrrolysine Containing GenesSubmitted to BMC Bionformatics, 2012
INTRODUCTION Background A peek into my work Conclusions
THE TROUBLE WITH TABLING OF STRUCTURED DATA
INTRODUCTION Background A peek into my work Conclusions
THE TROUBLE WITH TABLING OF STRUCTURED DATA
An innocent lookingpredicate: last/2
last([X],X).last([_|L],X) :-
last(L,X).
I Traverses a list tofind the last element.
I Time/spacecomplexity: O(n).
I If we table last/2:n+ n− 1+ n− 2 . . . 1≈ O(n2) !
call:last([1,2,3,4,5],X)last([1,2,3,4,5],X) last([1,2,3,4],X) last([1,2,3],X) last([1,2],X) last([1],X)
call tablelast([1,2,3,4,5],X).last([1,2,3,4],X).last([1,2,3],X).last([1,2],X).last([1],X).
INTRODUCTION Background A peek into my work Conclusions
A WORKAROUND IMPLEMENTED IN PROLOG
We describe a workaround giving O(1) time and spacecomplexity for table lookups for programs with arbitrarilylarge ground structured data as input arguments.
I A term is represented as a set of facts.I A subterm is referenced by a unique integer serving as an
abstract pointer.I Matching related to tabling is done solely by comparison
of such pointers.
INTRODUCTION Background A peek into my work Conclusions
AN ABSTRACT DATA TYPE
store_term( +ground-term, pointer)The ground-term is any ground term, and thepointer returned is a unique reference (an integer)for that term.
retrieve_term( +pointer, ?functor, ?arg-pointers-list)Returns the functor and a list of pointers torepresentations of the substructures of the termrepresented by pointer.
INTRODUCTION Background A peek into my work Conclusions
ADT EXAMPLE
The following call converts the term f(a,g(b)) into itsinternal representation and returns a pointer value in thevariable P.
store_term(f(a,g(b)),P).
Implementation with assert, e.g.,
retrieve_term(100,f,[101,102]).retrieve_term(101,a,[]).retrieve_term(102,g,[103]).retrieve_term(103,b,[]).
INTRODUCTION Background A peek into my work Conclusions
AN AUTOMATIC PROGRAM TRANSFORMATION
We introduce an automatic transformation:
Structured terms are moved from the head of clauses to calls inthe body to retrieve_term/2.
INTRODUCTION Background A peek into my work Conclusions
TRANSFORMED HIDDEN MARKOV MODEL
original version
hmm(_,[]).
hmm(S,[Ob|Y]) :-msw(out(S),Ob),msw(tr(S),Next),hmm(Next,Y).
transformed version
hmm(S,ObsPtr):-retrieve_term(ObsPtr,[]).
hmm(S,ObsPtr) :-retrieve_term(ObsPtr,[Ob,Y]),msw(out(S),Ob),msw(tr(S),Next),hmm(Next,Y).
INTRODUCTION Background A peek into my work Conclusions
BENCHMARKING RESULTS
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 1000 2000 3000 4000 5000
020
4060
8010
012
014
0
b) Running time without indexed lookup
sequence length
Run
ning
tim
e (s
econ
ds)
●●
●●
● ●●
●●
●● ●
●●
●●
●
● ●
●
●●
● ●● ●
●●
●●
●●
●●
● ●
●
●
● ●●
●●
●●
● ● ●●
●
0 1000 2000 3000 4000 5000
0.00
0.02
0.04
0.06
0.08
a) Running time with indexed lookup
sequence length
Run
ning
tim
e (s
econ
ds)
INTRODUCTION Background A peek into my work Conclusions
THE NEXT STEP
Integration at the Prolog engine implementation level.
Neng-Fa Zhou and Christian Theil HaveEfficient tabling of structured data with enhanced hash-consingTheory and Practice of Logic Programming, 2012
I Full sharing between tables (call and answer)I Sharing with structured data in call stack
INTRODUCTION Background A peek into my work Conclusions
CONSTRAINED HMMS
DefinitionA constrained HMM (CHMM)I is an HMMI extended with a set of constraints C, each of which is a
mapping from HMM runs into {true, false}.
INTRODUCTION Background A peek into my work Conclusions
CONSTRAINED HMMS
Why extend an HMM with side-constraints?
I Convenient to express knowledge in terms of constraintsI Reuse underlying model with different assumptionsI Some constraints are not feasible as model structure (e.g.all_different)
I fewer paths to consider for any given sequence→ decreased running time (under certain conditions)
INTRODUCTION Background A peek into my work Conclusions
ALIGNMENT WITH A CONSTRAINED PAIR HMM
In a biological context, we may want toonly consider alignments with a limitednumber of insertions and deletions giventhe assumption that the two sequencesare closely related.
INTRODUCTION Background A peek into my work Conclusions
ALIGNMENT WITH CONSTRAINTS
INTRODUCTION Background A peek into my work Conclusions
ADDING CONSTRAINT CHECKING TO THE HMMHMM with constraint checking
hmm(T,State,[Emit|EmitRest],StoreIn) :-T > 0,msw(trans(State),NxtState),msw(emit(NxtState),Emit),check_constraints([NxtState,Emit],StoreIn,StoreOut),T1 is T-1,hmm(T1,NxtState,EmitRest,StoreOut).
I Call to check_constraints/3 after each distinctsequence of msw applications
I Side-constaints: The constraints are assumed to bedeclared elsewhere and not interleaved with modelspecification
I Extra Store argument in the probabilistic predicate
INTRODUCTION Background A peek into my work Conclusions
A LIBRARY OF GLOBAL CONSTRAINTS FOR HIDDEN
MARKOV MODELS
Our implementation contains a few well-known globalconstraints adapted to Hidden Markov Models.
Global constraintscardinality lock_to_sequenceall_different lock_to_set
In addition, the implementation provides operators which maybe used to apply constraints to a limited set of variables.
Constraint operatorsstate_specificemission_specificforall_subseq (sliding window operator)for_range (time step range operator)
INTRODUCTION Background A peek into my work Conclusions
TABLING ISSUES
Problem: The extra Store argument makes PRISM tablemultiple goals (for different constraint stores) when it shouldonly store one.
hmm(T,State,[Emit|EmitRest],Store)
To get rid of the extra argument, check_constraintsdynamically maintains it as a stack using assert/retract.
Note: This is not sound solution for all types of constraints(some need tabling).
INTRODUCTION Background A peek into my work Conclusions
IMPACT OF USING A SEPARATE CONSTRAINT STORE
STACK
INTRODUCTION Background A peek into my work Conclusions
GENOME MODELS
I Gene finding in a genomic contextI What are the constraints between adjacent genes in the
genome?I Extent of (possible) overlap
I Modeled as hard constraintsI Gene reading frames, i.e., due to leading strand bias,
operons etc.I Modeled as (probabilistic) soft constraints
INTRODUCTION Background A peek into my work Conclusions
AN APPLICATION OF CONSTRAINED MARKOV
MODELS
We wish to incorporate overlapping gene constraints into genefinding.Divide and conquer two step approach to gene finding:
1. Gene prediction: A gene finder supplies a set of candidatepredictions p1 . . . pn, called the initial set.
2. Pruning: The initial set is pruned according to certain rulesor constraints. We call the pruned set the final set.
INTRODUCTION Background A peek into my work Conclusions
PRUNING STEP AS A CONSTRAINT OPTIMIZATION
PROBLEM
CSP formulationWe introduce variables X = xi . . . xn corresponding to eachprediction p1 . . . pn in the initial set (sorted by position ingenome) All variables have boolean domains,∀xi ∈ X,D(xi) = {true, false} and xi = true⇔ pi ∈ final set.
I Multiple solutionsI We want the “best” solutionI Optimize for prediction confidence scoresI Constraint Optimization Problem (COP)
COP formulationLet the scores of p1 . . . pn be s1 . . . sn and si ∈ R+.Maximize
∑ni=1 si, subject to C.
INTRODUCTION Background A peek into my work Conclusions
COP IMPLEMENTATION WITH MARKOV CHAIN (1)We propose to use a (constrained) Markov chain for the COP.
I The Markov chain has a begin state, an end state and twostates for each variable xi corresponding to its booleandomain D(xi).
I The state corresponding to D(xi) = true is denoted αi andthe state corresponding to D(xi) = false is denoted βi.
I In this model, a path from the begin state to the end statecorresponds to a potential solution of the CSP.
INTRODUCTION Background A peek into my work Conclusions
FROM CONFIDENCE SCORES TO TRANSITION
PROBABILITIESI P(α1|begin) = σ1I P(β1|begin) = 1− σ1I P(end|αn) = P(end|βn) = 1.I P(αi|αi−1) = P(αi|βi−1) = σiI P(βi|αi−1) = P(βi|βi−1) = 1− σi
σi = 0.5 + λ+(0.5− λ)× (si −min(s1 . . . sn))
max(s1 . . . sn)−min(s1 . . . sn)
INTRODUCTION Background A peek into my work Conclusions
ENCODING CONSTRAINTS WITH CONSTRAINT
HANDLING RULESConstraints: alpha/2 and beta/2 ≈ visited states.
Example: Genemark inconsistency rules
alpha(Left1,Right1), alpha(Left2,Right2) <=>Left1 =< Left2, Right1 >= Right2 | fail.
beta(Left1,Right1), alpha(Left2,Right2) <=>Left1 =< Left2, Right1 >= Right2 | fail.
I The most probable consistent path is found using PRISMsgeneric adaptation of the Viterbi algorithm
I Each step adds either a alpha or beta (active) constraintI Incremental Pruning: For each step we only apply
constraints which may be transitively involved in ruleswith the active constraint
INTRODUCTION Background A peek into my work Conclusions
EXPERIMENTAL RESULTS
I Prediction on E.coli. using simplistic codon frequencybased gene finder.
I Pruning using our global optimization approach (with allinconsistency rules) versus local heuristic rules2.
Method #predictions Sensitivity Specificity Time (seconds)initial set 10799 0.7625 0.2926 na
Genemark rules 5823 0.7558 0.5379 1.4ECOGENE rules 4981 0.7148 0.5947 1.7
global optimization 5222 0.7201 0.5714 75
I Sensitivity = fraction of known reference genes predicted.
I Specificity = fraction of predicted genes that are correct.2Note that the results for the ECOGENE heuristic may vary depending on
execution strategy - in case of above results, predictions with lower leftposition are considered first.
INTRODUCTION Background A peek into my work Conclusions
A MODEL FOR THE GENOME-WIDE SEQUENCE OF
READING FRAMES
We wish to incorporate gene reading frame constraints intogene finding.Divide and conquer two step approach to gene finding (again):
1. Gene prediction: A gene finder supplies a set of candidatepredictions p1 . . . pn, called the initial set.
2. Pruning: The initial set is pruned according to gene finderconfidence scores and the the probabilities adjacent genereading frames. We call the pruned set the final set.
INTRODUCTION Background A peek into my work Conclusions
METHODOLOGY
I Genes predictions aresorted by stop codonposition.
I Gene finder scores arediscretized into symbolicvalues.
I A type of Hidden MarkovModel which we call adelete-HMM:
I A state for each of thesix possible readingframes and
I one delete state
F1
F2
F3
F4
F5
F6
delete
INTRODUCTION Background A peek into my work Conclusions
MODEL
I F1 . . . F6 Emissions: Finiteset of i symbols δ1 . . . δncorresponding to ranges ofprediction scores
I Delete state emission:P(δi|state = delete) =
FPδiFP
I Frame state transitions:Relative frequency of"observed" adjacent genereading frame pairs(normalized)
I Transition to delete:P(delete) = 1− TP
TP+FP(tunable)
F1
F2
F3
F4
F5
F6
delete
INTRODUCTION Background A peek into my work Conclusions
RESULTS
1−Specificity (FPR)
Sens
itivi
ty (T
PR)
0.0 0.1 0.2 0.3 0.4 0.5
0.5
0.6
0.7
0.8
0.9
1.0
thresholdframeseq, trained on Escherichiaframeseq, trained on Salmonellaframeseq, trained on Legionellaframeseq, trained on Bacillusframeseq, trained on Thermoplasma
INTRODUCTION Background A peek into my work Conclusions
CONCLUSIONS
1. To what extent is it possible to useprobabilistic logic programming forbiological sequence analysis?
2. How can constraints relevant to thedomain of biological sequence analysisbe combined with probabilistic logicprogramming?
3. What are the limitations with regard toefficiency and how can these be dealtwith?
INTRODUCTION Background A peek into my work Conclusions
CONCLUSIONS
I Commonly used models for biological analysis canconveniently expressed using probabilistic logicprogramming
I Probabilistic logic programming is also a powerful tool forexperimenting with new kinds of models
I It can support integration of constraints in a variety ofways
I Efficiency is an issue, but with suitable optimizations it isefficient enough for many interesting problems
I It is not merely a powerful abstractionI A valuable and practical tool for biological sequence
analysis
INTRODUCTION Background A peek into my work Conclusions
THANKS