View
224
Download
0
Tags:
Embed Size (px)
Citation preview
To recap last episode… Hidden Markov Models (HMMs) Protein Family Characterization Profile HMMs for protein family
characterization How profile HMMs can do homology search
...picking up where we left off
Profile HMMs were good to start with
Today’s goal: Introduce HMMs as general tools in bioinformatics
I will use the problem of Gene Finding as an example of an “ideal” HMM problem domain
Learning Objectives
When I’m done you should know:1. When is an HMM a good fit for a problem space?
2. What materials are needed before work can begin with an HMM?
3. What are the advantages and disadvantages of using HMMs?
4. What are the general objectives and challenges in the gene finding task?
Outline
HMMs as Statistical Models The Gene Finding task at a glance Good problems for HMMs HMM Advantages HMM Disadvantages Gene Finding Examples
Statistical Models
Definition: Any mathematical construct that attempts to parameterize
a random process
Example: A normal distribution Assumptions Parameters Estimation Usage
HMMs are just a little more complicated…
HMM Assumptions Observations are ordered Random process can be represented by a
stochastic finite state machine with emitting states.
HMM Parameters Using weather example Modeling daily weather
for a year Ra Ra Su Su Su Ra..
Lots of parameters One for each table entry
Represented in two tables. One for emissions One for transitions
HMM Estimation
Called training, it falls under machine learning
Feed an architecture (given in advance) a set of observation sequences
The training process will iteratively alter its parameters to fit the training set
The trained model will assign the training sequences high probability
HMM Usage
Two major tasks Evaluate the probability of an observation
sequence given the model (Forward) Find the most likely path through the model
for a given observation sequence (Viterbi)
Gene Finding(An Ideal HMM Domain)
Our Objective: To find the coding and non-coding regions of an
unlabeled string of DNA nucleotides Our Motivation:
Assist in the annotation of genomic data produced by genome sequencing methods
Gain insight into the mechanisms involved in transcription, splicing and other processes
Gene Finding Terminology
A string of DNA nucleotides containing a gene will have separate regions (lines): Introns – non-coding regions within a gene Exons – coding regions
Separated by functional sites (boxes) Start and stop codons Splice sites – acceptors and donors
Gene Finding Challenges
Need the correct reading frame Introns can interrupt an exon in mid-codon
There is no hard and fast rule for identifying donor and acceptor splice sites Signals are very weak
What makes a good HMM problem space?
Characteristics: Classification problems
There are two main types of output from an HMM: Scoring of sequences
(Protein family modeling) Labeling of observations within a sequence
(Gene Finding)
HMM Problem CharacteristicsContinued
The observations in a sequence should have a clear, and meaningful order Unordered observations will not map easily to
states It’s beneficial, but not necessary for the
observations follow some sort of grammar Makes it easier to design an architecture Gene Finding Protein Family Modeling
HMM Requirements
So you’ve decided you want to build an HMM, here’s what you need:
An architecture Probably the hardest part Should be biologically sound & easy to interpret
A well-defined success measure Necessary for any form of machine learning
HMM Requirements Continued
Training data Labeled or unlabeled – it depends
You do not always need a labeled training set to do observation labeling, but it helps
Amount of training data needed is: Directly proportional to the number of free parameters
in the model Inversely proportional to the size of the training
sequences
Why HMMs might be a good fit for Gene Finding
Classification: Classifying observations within a sequence Order: A DNA sequence is a set of ordered observations Grammar / Architecture: Our grammatical structure (and the
beginnings of our architecture) is right here:
Success measure: # of complete exons correctly labeled Training data: Available from various genome annotation
projects
HMM Advantages
Statistical Grounding Statisticians are comfortable with the theory
behind hidden Markov models Freedom to manipulate the training and
verification processes Mathematical / theoretical analysis of the results
and processes HMMs are still very powerful modeling tools – far
more powerful than many statistical methods
HMM Advantages continued
Modularity HMMs can be combined into larger HMMs
Transparency of the Model Assuming an architecture with a good design People can read the model and make sense of it The model itself can help increase understanding
HMM Advantages continued
Incorporation of Prior Knowledge
Incorporate prior knowledge into the architecture
Initialize the model close to something believed to be correct
Use prior knowledge to constrain training process
How does Gene Finding make use of HMM advantages?
Statistics: Many systems alter the training process to better
suit their success measure Modularity:
Almost all systems use a combination of models, each individually trained for each gene region
Prior Knowledge: A fair amount of prior biological knowledge is built
into each architecture
HMM Disadvantages
Markov Chains States are supposed to be independent
P(y) must be independent of P(x), and vice versa This usually isn’t true Can get around it when relationships are local Not good for RNA folding problems
P(x) … P(y)
HMM Disadvantagescontinued
Standard Machine Learning Problems
Watch out for local maxima Model may not converge to a truly optimal
parameter set for a given training set
Avoid over-fitting You’re only as good as your training set More training is not always good
HMM Disadvantagescontinued
Speed!!!
Almost everything one does in an HMM involves: “enumerating all possible paths through the model”
There are efficient ways to do this
Still slow in comparison to other methods
HMM Gene Finders:VEIL
A straight HMM Gene Finder Takes advantage of grammatical structure and
modular design Uses many states that can only emit one symbol to
get around state independence
HMM Gene Finders:HMMGene
Uses an extended HMM called a CHMM CHMM = HMM with classes Takes full advantage of being able to modify
the statistical algorithms Uses high-order states Trains everything at once
HMM Gene Finders:Genie
Uses a generalized HMM (GHMM) Edges in model are complete HMMs States can be any arbitrary program States are actually neural networks specially
designed for signal finding
Conclusions
HMMs have problems where they excel, and problems where they do not
You should consider using one if: Problem can be phrased as classification Observations are ordered The observations follow some sort of grammatical
structure (optional)
Conclusions
Advantages: Statistics Modularity Transparency Prior Knowledge
Disadvantages: State independence Over-fitting Local Maximums Speed
Some final words…
Lots of problems can be phrased as classification problems Homology search, sequence alignment
If an HMM does not fit, there’s all sorts of other methods to try with ML/AI: Neural Networks, Decision Trees Probabilistic
Reasoning and Support Vector Machines have all been applied to Bioinformatics