40
Transcription factor binding motifs (part I) 10/17/07

Transcription factor binding motifs (part I) 10/17/07

  • View
    238

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transcription factor binding motifs (part I) 10/17/07

Transcription factor binding motifs (part I)

10/17/07

Page 2: Transcription factor binding motifs (part I) 10/17/07

Steps of gene transcription

TATA

activator

TFIID

Pol II Pol II

The term “transcription factor” (TF) usually means an activator or repressor.

Page 3: Transcription factor binding motifs (part I) 10/17/07

Understand Regulation

• Which TFs are involved in the regulation?

• Does a TF enhance / repress gene expression?

• Which genes are regulated by this TF?

• Are there binding partner / competitor for the TF?

• Why disease when a TF went wrong?

Page 4: Transcription factor binding motifs (part I) 10/17/07

Understand Regulation

• Which TFs are involved in the regulation?

• Does a TF enhance / repress gene expression?

• Which genes are regulated by this TF?

• Are there binding partner / competitor for the TF?

• Why disease when a TF went wrong?

Page 5: Transcription factor binding motifs (part I) 10/17/07

Sequence specificity of TF binding

Page 6: Transcription factor binding motifs (part I) 10/17/07

Motif representation

• Consensus: GCGAA

• PWM

Alignment matrix

Page 7: Transcription factor binding motifs (part I) 10/17/07

Motif representation

• Consensus: GCGAA

• PWM

frequency matrix

Page 8: Transcription factor binding motifs (part I) 10/17/07

Motif representation

• Consensus: GCGAA

• PWM

• Logo

Page 9: Transcription factor binding motifs (part I) 10/17/07

Objectives of motif finding

• Known motif mapping– Given a known motif, find all the matches over

a query sequence.

• De novo motif discovery– Both motif patterns and match positions are

unknown– much harder

Page 10: Transcription factor binding motifs (part I) 10/17/07

Known Motif Mapping

• The matching score for a new sequence x is given by

wherem is the entries in the frequency matrix

is the background model: p0(A), …, p0(T), or can be

third-order Markov model (see next slide).

• Calculate the matching score for all genomic sequences.

Motif sites correspond to highest scores.

) model background | Pr(

) model motif | Pr(log

)|Pr(

)|Pr(log 2

02 x

x

x

xS m

i

xim ipx ,)|Pr(

TGCAjwiijm p ,,,;,,1)(

Page 11: Transcription factor binding motifs (part I) 10/17/07

Third-order Markov model

• The probability of generating a new base is dependent on the previous three bases.

3rd order Markov dependencyp( )

)|(

)|(

)|(

)|(

)|()(

TGTAP

ATGTP

TATGP

TTATP

CTTAPATGTAP

Page 12: Transcription factor binding motifs (part I) 10/17/07

De novo motif discovery

• Statistical approach– Identify sequence patterns that occur more frequently

than random.– Target regions:

• Promoters regions of co-regulated genes• Promoters regions of differentially expressed genes• Experimentally identified TF binding sites

– Very common

• Biophysical approach– Calculate protein-DNA binding affinities from first

principles.– See Roider et al. 2006 for an example.

Page 13: Transcription factor binding motifs (part I) 10/17/07

Methods

• PWM modeling– MEME, GMS, AlignACE, BioProspector

• Word enumeration– YMF, MDScan

• Use negative control– REDUCE, Motif Regressor

• Comparative genomic– MCS, ComparProspector, Phylocon

• CHIP-chip (will discuss later)

Page 14: Transcription factor binding motifs (part I) 10/17/07

The challenges

no motif sites

Page 15: Transcription factor binding motifs (part I) 10/17/07

The challenges

multiple motif sites

Page 16: Transcription factor binding motifs (part I) 10/17/07

The challenges

variable relative positions

Page 17: Transcription factor binding motifs (part I) 10/17/07

The challenges

variable sequence pattern

ATCCG

ATTCG

Page 18: Transcription factor binding motifs (part I) 10/17/07

MEME

(Bailey and Elkan 1994)

• Input– A set of sequences: Y = {Yi}

– For a fixed length w, partition Y into overlapping w-mers: X = {Xi}

– A set of alphabets: A = {aj} = {A,C,G,T}

• Mixture Model

m Motif model:

0 Background model: 0th or 3rd Markov

TGCAjwiijm p ,,,;,...,1)(

0)1(~ mX

Page 19: Transcription factor binding motifs (part I) 10/17/07

• Missing data: Z = { Zi }

• The log-likelihood is

• Select and to maximize the log-likelihood, but how?

Log-likelihood

Page 20: Transcription factor binding motifs (part I) 10/17/07

Expectation-Maximization (EM)

• Iteratively update hidden states and parameter values. Commonly used in bioinformatics research.

• E-step:– Under current estimate of , , and the observed

data, evaluate the expected value of log-likelihood over the values of the missing data Z.

Page 21: Transcription factor binding motifs (part I) 10/17/07

Expectation Maximization (EM)

• M-step:– Update the parameters so that expected log-

likelihood is maximized.

For

For

Iterative E- and M- steps until convergence

Page 22: Transcription factor binding motifs (part I) 10/17/07

Issue with EM algorithm

• Can get trapped into local minimum

• Results depend on initial guess

• Often need to do multiple runs starting with difference initial guesses. Then pick the best one.

Page 23: Transcription factor binding motifs (part I) 10/17/07

Gibbs sampling

• Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables

• Gibbs sampling is applicable when the joint distribution is not known explicitly, but the conditional distribution of each variable is known.

• The sequence of samples comprises a Markov Chain.

• As the iteration number goes to infinity, the asymptotic distribution approaches the underlying joint distribution.

Page 24: Transcription factor binding motifs (part I) 10/17/07

Key differences between EM and Gibbs sampling

EM Gibbs Sampling

Maximum likelihood Posterior

Deterministic Stochastic

Frequenist Bayesian

Initialize seed for Initialize prior for

Page 25: Transcription factor binding motifs (part I) 10/17/07

Gibbs Motif Sampler

31

41

51

21

11

(Lawrence et al. 1993; Liu et al. 1995)

Assume each sequence contains one motif. But the position and the motif frequency matrix are unknown.

Page 26: Transcription factor binding motifs (part I) 10/17/07

Gibbs Motif Sampler

1 Without11 Segment

• Take out one sequence with its sites from current motifTake out one sequence with its sites from current motif

31

41

51

21

11

Page 27: Transcription factor binding motifs (part I) 10/17/07

Segment (2-7): 3

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Se

gm

en

t S

core

Sequence 1

Gibbs Motif Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

31

41

51

21

1 Without11 Segment

Page 28: Transcription factor binding motifs (part I) 10/17/07

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Se

gm

en

t S

core

12

Modified 1

Gibbs Motif Sampler• Sample a new segment to put the sequence backSample a new segment to put the sequence back

31

41

51

21

Page 29: Transcription factor binding motifs (part I) 10/17/07

Advantage of Gibbs sampling

• Stochastic sampling permits the algorithm to escape from local minima. More robust than determinstic sampling as in EM.

• Fast.

Page 30: Transcription factor binding motifs (part I) 10/17/07

Transcription level changes in glucose vs galactose

(Roth 1998)

Page 31: Transcription factor binding motifs (part I) 10/17/07

(Roth 1998)

Page 32: Transcription factor binding motifs (part I) 10/17/07

MDscan

(Liu et al. 2002)• Basic idea

– True targets are likely to be more differentially expressed than other genes.

• Procedure:– Rank genes according to p-values, gene expression

levels, etc. – Search TF motif from highest ranking targets first

(high signal / background ratio)– Refine candidate motifs with all targets

Page 33: Transcription factor binding motifs (part I) 10/17/07

Similarity defined by m-match

For a given w-mer and any other random w-mer

TGTAACGT 8-mer

TGTAACGT matched 8

AGTAACGT matched 7

TGCAACAT matched 6

TGACACGG matched 5

AATAACAG matched 4

m-matches for TGTAACGT

Pick a reasonable m to call two w-mers similar

Page 34: Transcription factor binding motifs (part I) 10/17/07

MDscan Algorithm:Finding candidate motifs

Seed1 m-matches

Sig

nific

ance

of d

iffer

entia

l gen

e ex

pres

sion

Page 35: Transcription factor binding motifs (part I) 10/17/07

MDscan Algorithm:Finding candidate motifs

Seed2 m-matches

Sig

nific

ance

of d

iffer

entia

l gen

e ex

pres

sion

Page 36: Transcription factor binding motifs (part I) 10/17/07

• Maximum a posteriori (MAP) score function:

• Prefer: conserved motifs with many sites, but are not often seen in the genome background

• Keep best 30-50 candidate motifs

MDscan Algorithm:Scoring candidate motifs

Motif Signal Abundant

PositionsConserved

Specific (unlikely in genome background)

Page 37: Transcription factor binding motifs (part I) 10/17/07

MDscan Algorithm:Update motifs with remaining seqs

Seed1 m-matches

Sig

nific

ance

of d

iffer

entia

l gen

e ex

pres

sion

Page 38: Transcription factor binding motifs (part I) 10/17/07

Seed1 m-matches

MDscan Algorithm:Refine the motifs

Sig

nific

ance

of d

iffer

entia

l gen

e ex

pres

sion

Page 39: Transcription factor binding motifs (part I) 10/17/07

MDscan Algorithm

• Check high signal/background ratio sequences first, more likely to find the correct motif

• Algorithm summary:– Seed with w-mer in top, find m-match to make matrix– Keep good motifs to be update by remaining

sequences– Refine motifs by removing bad sites

• Can check motif of any width very fast– Only consider existing w-mers, finite dataset– Seed in top sequences O(n2)– Update motifs with all sequences O(n)

Page 40: Transcription factor binding motifs (part I) 10/17/07

Word enumeration

YMF (Sinha and Tompa 2002)• Search in ALL possible w-mers. For each w-mer,

calculate a z-score measuring whether it is over-represented in the selected sequences vs the background.

• Rank the words by the z-score.• Select the top ones.

Advantage:• Global optimum

Drawback:• Computational time grows exponentially with w, so can

only be used to search short motifs. 6~10 mer.