17
Motif finding methods and algorithms t of n promoters of n coregulated genes, find a motif common to the Both the PWM and the motif sequences are unknown. Common methods: 1. Enumeration: Simplest case: look at the frequency of all n-mers * Finds Global Optimum since can search entire space 2. EM algorithms (MEME): Iteratively hone in on the most likely motif model 3. Gibbs sampling methods (AlignAce, BioProspector) Iteratively replace (‘sample’) sites to retrain the matrix 1

Motif finding methods and algorithms

  • Upload
    khalil

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

Motif finding methods and algorithms. Given a set of n promoters of n coregulated genes, find a motif common to the promoters. Both the PWM and the motif sequences are unknown. Common methods: 1 . Enumeration: Simplest case: look at the frequency of all n-mers - PowerPoint PPT Presentation

Citation preview

Page 1: Motif finding methods and algorithms

Motif finding methods and algorithms

Given a set of n promoters of n coregulated genes, find a motif common to the promoters.Both the PWM and the motif sequences are unknown.

Common methods:1. Enumeration:

Simplest case: look at the frequency of all n-mers* Finds Global Optimum since can search entire space

2. EM algorithms (MEME): Iteratively hone in on the most likely motif model

3. Gibbs sampling methods (AlignAce, BioProspector)Iteratively replace (‘sample’) sites to retrain the matrix

1

Page 2: Motif finding methods and algorithms

New MEME tools:http://meme.ebi.edu.au/meme/intro.html

2

http://meme.sdsc.edu/meme/doc/fasta-get-markov.html

(create your own Nth order markov background model)

Page 3: Motif finding methods and algorithms

Motif finding using the EM algorithm MEME (Bailey & Elkan 1995)

http://meme.sdsc.edu/meme/intro.html

EM algorithm: Expectation-MaximizationIn one run, trains the matrix model and identifies examples of the matrix

MEME works by iteratively refining matrix and identifying sites: 1. Estimate motif model

a. Start with an n-mer seed (random or specified)b. Build a matrix by incorporating some of background frequencies

2. Identify examples of the modela. For every n-mer in the input set, identify its probability given the matrix model

3. Re-estimate the motif modela. Calculate a new matrix, based on the weighted frequencies of all n-mers in the set

4. Iteratively refine the matrix and identify sites until convergence.

3

Page 4: Motif finding methods and algorithms

S1: GGCTATTGCAGATGACGAGATGAGGCCCAGACC

S2: GGATGACTTATATAAAGGACGATAAGAGATGAC

S3: CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

S4: GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

1. MEME uses an initial EM heuristic to estimate the bestStarting-point matrix:

G 0.26 0.24 0.18 0.26 0.25 0.26A 0.24 0.26 0.28 0.24 0.25 0.22T 0.25 0.23 0.30 0.25 0.25 0.25C 0.25 0.27 0.24 0.25 0.25 0.27

Problem: find a 6-mer motif in 4 sequences

4

Page 5: Motif finding methods and algorithms

GGCTATTGCATATGACGAGATGAGGCCCAGACC

GGATGACTTATATAAAGGACCGTGATAAGAGATTAC

CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

2. MEME scores the match of all 6-mers to current matrix

Here, just consider the underlined 6-mers,

Although in reality all 6-mers are scored

5

Page 6: Motif finding methods and algorithms

GGCTATTGCATATGACGAGATGAGGCCCAGACC

GGATGACTTATATAAAGGACCGTGATAAGAGATTAC

CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

2. MEME scores the match of all 6-mers to current matrix

3. Reestimate the matrix based on the weighted contribution of all 6 mers

G 0.29 0.24 0.17 0.27 0.24 0.30A 0.22 0.26 0.27 0.22 0.28 0.18T 0.24 0.23 0.33 0.23 0.24 0.28C 0.24 0.27 0.23 0.28 0.24 0.24

The height of the basesabove corresponds tohow much that 6-mer counts in calculatingthe new matrix

6

Page 7: Motif finding methods and algorithms

GGCTATTGCATATGACGAGATGAGGCCCAGACC

GGATGACTTATATAAAGGACCGTGATAAGAGATTAC

CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

MEME scores the match of all 6-mers to current matrix

7

Page 8: Motif finding methods and algorithms

GGCTATTGCATATGACGAGATGAGGCCCAGACC

GGATGACTTATATAAAGGACCGTGATAAGAGATTAC

CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

Reestimate the matrix based on the weighted contribution of all 6 mers

G 0.40 0.20 0.15 0.42 0.24 0.30A 0.30 0.30 0.20 0.24 0.46 0.18T 0.15 0.30 0.45 0.16 0.15 0.28C 0.15 0.20 0.20 0.16 0.15 0.24

The height of the basesabove corresponds tohow much that 6-mer counts in calculatingthe new matrix

8

Page 9: Motif finding methods and algorithms

GGCTATTGCATATGACGAGATGAGGCCCAGACC

GGATGACTTATATAAAGGACCGTGATAAGAGATTAC

CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

MEME scores the match of all 6-mers to current matrix

Iterations continue until convergence (ie. numbers don’t change much between iterations)

9

Page 10: Motif finding methods and algorithms

Final motif

G 0.85 0.05 0.10 0.80 0.20 0.35A 0.05 0.60 0.10 0.05 0.60 0.10T 0.05 0.30 0.70 0.05 0.20 0.10C 0.05 0.05 0.10 0.10 0.10 0.35

10

Page 11: Motif finding methods and algorithms

MEME uses final matrix to identify examples of motif by LLR

S1: GGCTATTGCAGATGACGAGATGAGGCCCAGACC

S2: GGATGACTTATATAAAGGACGATAAGAGATGAC

S3: CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

S4: GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

Final motif

G 0.85 0.05 0.10 0.80 0.20 0.35A 0.05 0.60 0.10 0.05 0.60 0.10T 0.05 0.30 0.70 0.05 0.20 0.10C 0.05 0.05 0.10 0.10 0.10 0.35

11

Page 12: Motif finding methods and algorithms

Choice of parameters significantly affects the algorithm-- motif width w-- motif model:

- “zoops” = zero-or-one motif per promoter sequence*- “oops” = one-or-more motif per promoter sequence*- “ans” = (“any number of sites”)

two-component mixture model (ie. Each w-mer sequence iseither an example of the background model or the motif model)

-- background model:- simplest case: genomic nucleotide frequencies P(G,A,T,C)- nth-order Markov chain

(eg. 2nd order Markov chain = P(Ai|Ci-1) = P(CA) = dinucleotide frequencies)

*These models keep track of which input sequence (promoter) the motif came from,whereas ‘ans’ throws all “w-mers” into a bag

EM algorithm: Expectation-MaximizationIn one run, trains the matrix model and identifies examples of the matrix

Motif finding using the EM algorithm MEME (Bailey & Elkan 1995)

http://meme.sdsc.edu/meme/intro.html

12

Page 13: Motif finding methods and algorithms

Assessing the biological relevance of identified motifs

Keep an eye on these features:

1. Bit score (or normalized bit score)Bit score = Information Content at each position

2. Information content profileReal TF binding sites typically show smooth IC profiles

3. Number of input sequences that contain the motifOverfitting: great looking motif but found in only few of the input sequences

4. Nucleotide frequenciesEg. In yeast, AT rich sequences are common

… doesn’t necessarily mean they’re not real binding sites

5. Enrichment of motif in the training set compared to genomic bgOur old friend, the hypergeometric distribution.

6. Finding the same consensus with different models or methods

7. Any other nonrandom observation can give you confidence(palindromic motif, nonrandom distribution of motifs in input sequences, etc)

13

Page 14: Motif finding methods and algorithms

Comparing matrices and motifs

TomTom

1. Pick a scoring function2. Calculate score for query matrix Q against ALL matrices in database3. Use those scores to estimate a distribution of scores to turn score into a p-value4. FDR turns p-value into an E value

14

Page 15: Motif finding methods and algorithms

Comparing matrices and motifs

Scoring functions: score each COLUMN being comparedColumn X of Motif Q vs. Column Y of Motif T

GATC

1 2 30.3

0.1

0.5

0.1

0.7

0.1

0.1

0.1

0.3

0.1

0.3

0.3

GATC

1 2 30.1

0.1

0.7

0.1

0.6

0.1

0.2

0.1

0.4

0.1

0.4

0.1

Xa = P(base a) in column X of QYa = P(base a) in column Y of T

15

Page 16: Motif finding methods and algorithms

Comparing matrices and motifs

Scoring functions: score each COLUMN being comparedColumn X of Motif Q vs. Column Y of Motif T

P(base a) over all a == 1

16

Page 17: Motif finding methods and algorithms

Motif QColumn X = 1 … X = 14

Motif TColumn Y = 1 … Y = 13

17

Alignment of two matrices