Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Preview:

DESCRIPTION

Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence. LO Leung Yau 7 th May, 2009. Outline. Biological Background Objective Current Approaches Various Models Problem: Insufficient Data Proposed Approach Predict TFBS from protein sequence - PowerPoint PPT Presentation

Citation preview

Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

LO Leung Yau

7th May, 2009

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Biological Background – Cell

Basic unit of organisms Prokaryotic Eukaryotic

A bag of chemicals Metabolism controlled

by various enzymes Correct working needs

Suitable amounts of various proteins

Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

Biological Background – Protein Polymer of 20 types of

Amino Acids Folds into 3D structure Shape determines the

function Many types

Transcription Factors Enzymes Structural Proteins …

Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid

Biological Background – DNA & RNA DNA

Double stranded Adenine, Cytosine, Guani

ne, Thymine A-T, G-C Those parts coding for pr

oteins are called genes RNA

Single stranded Adenine, Cytosine, Guani

ne, Uracil

Picture taken from http://en.wikipedia.org/wiki/Gene

Biological Background – DNA RNA Protein

Picture taken from http://en.wikipedia.org/wiki/Gene

gene

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Complex Interactions between Genes, TFs and TFBSs

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Importance of Inferring Transcriptional Regulatory Network Revealing the working of a cell and life Related to many diseases

Genetic disorders Understanding them will help us

Understand the diseases Design drugs to cure the diseases Engineering genetics

Objective

To infer transcriptional regulatory network (gene network) from genetic

and experimental data, utilizing different data sources as/when

appropriate

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Current Approaches

Main Data Source Gene Expression Microarray Data

Models Parts Lists Topology Models Control Logic Models Dynamic Models

Problem Insufficient Data

Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C

pairing Can monitor expression

of many genes

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

Gene Expression Microarray Data

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

Various Models of Transcriptional Regulatory Network (Gene Network) Different level of details

Parts Lists Topology Models Control Logic Models Dynamic Models

Boolean Network Petri Nets Difference and Differential Equations Finite State Linear Model (FSLM) Stochastic Networks

[86, 87, 88]

Parts List

The basic components of the gene network that we model

Including Genes Transcription Factors Promoters Transcription Factor Binding Sites …

gene

Topology Models – Example

Control Logic Models

Dynamic Models

Describe and simulate the dynamic changes in the state of the system

Predicting the network’s response to various environmental changes and stimuli. Boolean Network Petri Nets Difference and Differential Equations Hybrid: Finite State Linear Model (FSLM) Stochastic Networks

Boolean Network

[42, 93, 1, 55]

Boolean Network –Yeast Fission Example

10 Genes1024 States

[22]

Petri Nets - Example

[79, 34, 67, 92]

Difference and Differential Equations Continuous concentration of various molecules For difference equation, time is discrete For differential equation, time is continuous In general, they have the form

)),(),...,(()(

)),(()(

)),(),...,(()(

)),(()(

1'

'

1

ttgtgFtg

ttgFtg

ttgtgFttg

ttgFttg

nii

nii

[15, 24, 96]

Difference and Differential Equations Usually, the interactions are assumed to be

linear The model needs many parameters

Interpretation:>>0 means gene n activates gene 1

Finite State Linear Model (FSLM)

[91, 2, 66]

Stochastic Networks

In the real world, stochastic effects may play an important role

Some stochastic models have been proposed Noisy Networks Probabilistic Boolean Networks

Simulating a stochastic model is more computationally expensive

Depending on the purpose, stochastic models may not be necessary

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Problem – Insufficient Data

In microarray data Many genes Small number of conditions/time points

Lead to unreliable estimated model

[17, 53]

Current Directions to Solve Insufficiency Problem Analysis Techniques for Small Sample Size

Regularization Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) Minimum Description Length (MDL) …

New model Integrate Multiple Microarray Data

Heterogeneous sources Different experiment settings

[21, 77, 54, 62, 104, 72, 84] [60, 107, 48, 8, 38]

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Proposed Approach – Use Sequence

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

There is a lot of information in genome sequenceWe should try to use them!

Proposed Approach – Core Components

DNARNAProtein

Binding Sites?

Transcription Factor?

The interaction between genes can therefore be inferred.

DNARNAProtein

Binding Sites?

Transcription Factor?

1

2

3

Proposed Approach – Core Components

TF

TFGene

Gene

Gene

Gene

Gene

TF

TF

Gene

Our approach gives initial network!

Can be used together with other approachesExtra!

Missed!Microarray Data

Component 1: Protein Sequence Binding Sites Need to predict

Binding domains of a protein The DNA segment bound by the domain The pattern bound by the protein

Need to search for occurrence of the pattern Better motif model is helpful

……………..LYDVAEYAGVSYQTVSRVV …………….

……………..gaaggGGTCAAGGTGACCgg……………

Protein

DNA

Picture taken from http://en.wikipedia.org/wiki/DNA-binding_domain

Component 2: Protein Sequence Transcription Factor ? Need to distinguish between

Transcription factors, and Other proteins

Characteristic motifs in binding domains are helpful features Transcription

Factor

Other Proteins

……………..LYDVAEYAGVSYQTVSRVV …………….

Component 3: DNA RNA Protein Sequence DNA pre-mRNA

Pre-mRNA mRNA

mRNA Protein sequence

Trivial, only TU

Alternative splicing!

Picture taken from http://en.wikipedia.org/wiki/Alternative_splicing

Genetic code of amino acids is known and quite

universal

Proposed Plan and Phases

Started!

Preparatory

Main Classifiers

Initial Network Construction &Testing Stage

Will start soon

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Short Term Subtasks

Q-gram Indexed Approximate String Matching Tool

Exploring Different Motif Models Motifs with gaps

Develop an Improved Tool to Search Significant Patterns and Calculate p-value Deterministic Finite Automata (DFA) Finite Markov Chain Imbedding (FMCI) Pattern Markov Chain (PMC)

Already Done.

Q-gram Indexed Approximate String Matching Tool IDEA: quickly discard parts of the target

which CANNOT contain a match A kind of pruning Pruning is a successful strategy in many

problems

Target (Text/DB/…) sequence

Filtered out regions, do not bother to do fully sensitive checking

Pattern

Q-gram Indexed Approximate String Matching Tool

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Exploring Different Motif Model Popular Motif Model

Position Weight Matrix (PWM) Assumptions

Fixed-length contiguous Independency of nucleotides

Easily handle wildcards But difficult to handle gaps

Has been successful in some datasets But perform poorly in Tompa(2005) dataset

Exploring Different Motif Model Aim:

To explore if motifs with gaps fit the data To explore different notions of “over-represented”

Approach: de novo motif discovery on existing dataset Assuming different models Assuming different notions of “over-represented”

Exploring Different Motif Model Models Tested

Model Wildcard? Gaps?

Exact

No Gap

No IUPAC

General

Exploring Different Motif Model - Notions of “over-represented” Count score:

P-value:

Estimated probability:

s1s2s3s4

s1+s2+s3+s44

Scores

X times

P(> X times in background)Background

Model

c1c2c3c4

P(TFBS | c1,c2,..,c4)

P(TFBS)P(c1,c2,..,c4 | TFBS)P(c1,c2,…,c4)=

Preliminary Results – Max F-measure

Precision

= TP/(TP+FP)

Recall

= TP/(TP+FN)

F-Measure

= 2pr/(p+r)

Preliminary Results – Tompa

Precision

= TP/(TP+FP)

Recall

= TP/(TP+FN)

F-Measure

= 2pr/(p+r)

Preliminary Results – Tompa

Precision

= TP/(TP+FP)

Recall

= TP/(TP+FN)

F-Measure

= 2pr/(p+r)

Preliminary Results – Tompa

Precision

= TP/(TP+FP)

Recall

= TP/(TP+FN)

F-Measure

= 2pr/(p+r)

Preliminary Results

Performance is data-set dependent Motif model with gaps worths exploring more P-value and Est-Prob worth exploring

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Develop an Improved Tool to Search Significant Pattern and Calculate p-value Given

a pattern (possibly as general as regular expression)

a sequence A model of the sequence

Want The occurrence number of the pattern The distribution of occurrence number In particular, the p-value

Develop an Improved Tool to Search Significant Pattern and Calculate p-value Usual Sequence Models

M00: i.i.d., letters equally likely

M0: General i.i.d.

M1: First Order Markov Chain

Mk: Kth Order Markov Chain

Finite Markov Chain Imbedding and Pattern Markov Chain

DFA forthe pattern

Occur C times

DFA forthe pattern

DFA forthe pattern

DFA forthe pattern

DFA forthe pattern

C+1 copies

BackgroundMarkov Model

P Q

P Q

P Q

P Q1

A large Markov Chain, which can be used to calculate the desired probability

easily

0 times 1 time 2 times c times

[27, 58, 28, 76]

Existing Tool: SPatt

SPatt (Nuel 2008) allows Arbitrary finite alphabet Wildcard “.”, which matches any character Wildcard “[abc]”, which matches any of a,b,c Gaps “.(a-b)” which means a to b wildcards Alternative “p1 | p2”, which means p1 or p2

But currently only 1st Order Markov Model No full regular expression

Any Kth order Markov Model

Allow Regular Expression

Want to Improve

[76]

Summary

Work Done Developed an approximate string matching tool Collected TRANSFAC data Started exploring different motif models

Thanks for your attention!

Q & A

Recommended