View
31
Download
0
Category
Tags:
Preview:
DESCRIPTION
Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence. LO Leung Yau 7 th May, 2009. Outline. Biological Background Objective Current Approaches Various Models Problem: Insufficient Data Proposed Approach Predict TFBS from protein sequence - PowerPoint PPT Presentation
Citation preview
Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence
LO Leung Yau
7th May, 2009
Outline
Biological Background Objective
Current Approaches Various Models Problem: Insufficient Data
Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence
Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns
Biological Background – Cell
Basic unit of organisms Prokaryotic Eukaryotic
A bag of chemicals Metabolism controlled
by various enzymes Correct working needs
Suitable amounts of various proteins
Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)
Biological Background – Protein Polymer of 20 types of
Amino Acids Folds into 3D structure Shape determines the
function Many types
Transcription Factors Enzymes Structural Proteins …
Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid
Biological Background – DNA & RNA DNA
Double stranded Adenine, Cytosine, Guani
ne, Thymine A-T, G-C Those parts coding for pr
oteins are called genes RNA
Single stranded Adenine, Cytosine, Guani
ne, Uracil
Picture taken from http://en.wikipedia.org/wiki/Gene
Biological Background – DNA RNA Protein
Picture taken from http://en.wikipedia.org/wiki/Gene
gene
Biological Background – DNA RNA Protein
Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding
sites (TFBS).
Other functions
Transcription FactorsBinding sites
GenesPromoter regions
Complex Interactions between Genes, TFs and TFBSs
Biological Background – DNA RNA Protein
Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding
sites (TFBS).
Other functions
Transcription FactorsBinding sites
GenesPromoter regions
Importance of Inferring Transcriptional Regulatory Network Revealing the working of a cell and life Related to many diseases
Genetic disorders Understanding them will help us
Understand the diseases Design drugs to cure the diseases Engineering genetics
Objective
To infer transcriptional regulatory network (gene network) from genetic
and experimental data, utilizing different data sources as/when
appropriate
Outline
Biological Background Objective
Current Approaches Various Models Problem: Insufficient Data
Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence
Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns
Current Approaches
Main Data Source Gene Expression Microarray Data
Models Parts Lists Topology Models Control Logic Models Dynamic Models
Problem Insufficient Data
Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C
pairing Can monitor expression
of many genes
Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment
Gene Expression Microarray Data
Picture taken from http://en.wikipedia.org/wiki/DNA_microarray
Various Models of Transcriptional Regulatory Network (Gene Network) Different level of details
Parts Lists Topology Models Control Logic Models Dynamic Models
Boolean Network Petri Nets Difference and Differential Equations Finite State Linear Model (FSLM) Stochastic Networks
[86, 87, 88]
Parts List
The basic components of the gene network that we model
Including Genes Transcription Factors Promoters Transcription Factor Binding Sites …
gene
Topology Models – Example
Control Logic Models
Dynamic Models
Describe and simulate the dynamic changes in the state of the system
Predicting the network’s response to various environmental changes and stimuli. Boolean Network Petri Nets Difference and Differential Equations Hybrid: Finite State Linear Model (FSLM) Stochastic Networks
Boolean Network
[42, 93, 1, 55]
Boolean Network –Yeast Fission Example
10 Genes1024 States
[22]
Petri Nets - Example
[79, 34, 67, 92]
Difference and Differential Equations Continuous concentration of various molecules For difference equation, time is discrete For differential equation, time is continuous In general, they have the form
)),(),...,(()(
)),(()(
)),(),...,(()(
)),(()(
1'
'
1
ttgtgFtg
ttgFtg
ttgtgFttg
ttgFttg
nii
nii
[15, 24, 96]
Difference and Differential Equations Usually, the interactions are assumed to be
linear The model needs many parameters
Interpretation:>>0 means gene n activates gene 1
Finite State Linear Model (FSLM)
[91, 2, 66]
Stochastic Networks
In the real world, stochastic effects may play an important role
Some stochastic models have been proposed Noisy Networks Probabilistic Boolean Networks
Simulating a stochastic model is more computationally expensive
Depending on the purpose, stochastic models may not be necessary
Outline
Biological Background Objective
Current Approaches Various Models Problem: Insufficient Data
Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence
Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns
Problem – Insufficient Data
In microarray data Many genes Small number of conditions/time points
Lead to unreliable estimated model
[17, 53]
Current Directions to Solve Insufficiency Problem Analysis Techniques for Small Sample Size
Regularization Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) Minimum Description Length (MDL) …
New model Integrate Multiple Microarray Data
Heterogeneous sources Different experiment settings
[21, 77, 54, 62, 104, 72, 84] [60, 107, 48, 8, 38]
Outline
Biological Background Objective
Current Approaches Various Models Problem: Insufficient Data
Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence
Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns
Proposed Approach – Use Sequence
Other functions
Transcription FactorsBinding sites
GenesPromoter regions
There is a lot of information in genome sequenceWe should try to use them!
Proposed Approach – Core Components
DNARNAProtein
Binding Sites?
Transcription Factor?
The interaction between genes can therefore be inferred.
DNARNAProtein
Binding Sites?
Transcription Factor?
1
2
3
Proposed Approach – Core Components
TF
TFGene
Gene
Gene
Gene
Gene
TF
TF
Gene
Our approach gives initial network!
Can be used together with other approachesExtra!
Missed!Microarray Data
Component 1: Protein Sequence Binding Sites Need to predict
Binding domains of a protein The DNA segment bound by the domain The pattern bound by the protein
Need to search for occurrence of the pattern Better motif model is helpful
……………..LYDVAEYAGVSYQTVSRVV …………….
……………..gaaggGGTCAAGGTGACCgg……………
Protein
DNA
Picture taken from http://en.wikipedia.org/wiki/DNA-binding_domain
Component 2: Protein Sequence Transcription Factor ? Need to distinguish between
Transcription factors, and Other proteins
Characteristic motifs in binding domains are helpful features Transcription
Factor
Other Proteins
……………..LYDVAEYAGVSYQTVSRVV …………….
Component 3: DNA RNA Protein Sequence DNA pre-mRNA
Pre-mRNA mRNA
mRNA Protein sequence
Trivial, only TU
Alternative splicing!
Picture taken from http://en.wikipedia.org/wiki/Alternative_splicing
Genetic code of amino acids is known and quite
universal
Proposed Plan and Phases
Started!
Preparatory
Main Classifiers
Initial Network Construction &Testing Stage
Will start soon
Outline
Biological Background Objective
Current Approaches Various Models Problem: Insufficient Data
Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence
Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns
Short Term Subtasks
Q-gram Indexed Approximate String Matching Tool
Exploring Different Motif Models Motifs with gaps
Develop an Improved Tool to Search Significant Patterns and Calculate p-value Deterministic Finite Automata (DFA) Finite Markov Chain Imbedding (FMCI) Pattern Markov Chain (PMC)
Already Done.
Q-gram Indexed Approximate String Matching Tool IDEA: quickly discard parts of the target
which CANNOT contain a match A kind of pruning Pruning is a successful strategy in many
problems
Target (Text/DB/…) sequence
Filtered out regions, do not bother to do fully sensitive checking
Pattern
Q-gram Indexed Approximate String Matching Tool
Outline
Biological Background Objective
Current Approaches Various Models Problem: Insufficient Data
Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence
Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns
Exploring Different Motif Model Popular Motif Model
Position Weight Matrix (PWM) Assumptions
Fixed-length contiguous Independency of nucleotides
Easily handle wildcards But difficult to handle gaps
Has been successful in some datasets But perform poorly in Tompa(2005) dataset
Exploring Different Motif Model Aim:
To explore if motifs with gaps fit the data To explore different notions of “over-represented”
Approach: de novo motif discovery on existing dataset Assuming different models Assuming different notions of “over-represented”
Exploring Different Motif Model Models Tested
Model Wildcard? Gaps?
Exact
No Gap
No IUPAC
General
Exploring Different Motif Model - Notions of “over-represented” Count score:
P-value:
Estimated probability:
s1s2s3s4
s1+s2+s3+s44
Scores
X times
P(> X times in background)Background
Model
c1c2c3c4
P(TFBS | c1,c2,..,c4)
P(TFBS)P(c1,c2,..,c4 | TFBS)P(c1,c2,…,c4)=
Preliminary Results – Max F-measure
Precision
= TP/(TP+FP)
Recall
= TP/(TP+FN)
F-Measure
= 2pr/(p+r)
Preliminary Results – Tompa
Precision
= TP/(TP+FP)
Recall
= TP/(TP+FN)
F-Measure
= 2pr/(p+r)
Preliminary Results – Tompa
Precision
= TP/(TP+FP)
Recall
= TP/(TP+FN)
F-Measure
= 2pr/(p+r)
Preliminary Results – Tompa
Precision
= TP/(TP+FP)
Recall
= TP/(TP+FN)
F-Measure
= 2pr/(p+r)
Preliminary Results
Performance is data-set dependent Motif model with gaps worths exploring more P-value and Est-Prob worth exploring
Outline
Biological Background Objective
Current Approaches Various Models Problem: Insufficient Data
Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence
Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns
Develop an Improved Tool to Search Significant Pattern and Calculate p-value Given
a pattern (possibly as general as regular expression)
a sequence A model of the sequence
Want The occurrence number of the pattern The distribution of occurrence number In particular, the p-value
Develop an Improved Tool to Search Significant Pattern and Calculate p-value Usual Sequence Models
M00: i.i.d., letters equally likely
M0: General i.i.d.
M1: First Order Markov Chain
Mk: Kth Order Markov Chain
Finite Markov Chain Imbedding and Pattern Markov Chain
DFA forthe pattern
Occur C times
DFA forthe pattern
DFA forthe pattern
DFA forthe pattern
DFA forthe pattern
C+1 copies
BackgroundMarkov Model
P Q
P Q
P Q
P Q1
…
A large Markov Chain, which can be used to calculate the desired probability
easily
0 times 1 time 2 times c times
[27, 58, 28, 76]
Existing Tool: SPatt
SPatt (Nuel 2008) allows Arbitrary finite alphabet Wildcard “.”, which matches any character Wildcard “[abc]”, which matches any of a,b,c Gaps “.(a-b)” which means a to b wildcards Alternative “p1 | p2”, which means p1 or p2
But currently only 1st Order Markov Model No full regular expression
Any Kth order Markov Model
Allow Regular Expression
Want to Improve
[76]
Summary
Work Done Developed an approximate string matching tool Collected TRANSFAC data Started exploring different motif models
Thanks for your attention!
Q & A
Recommended