Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10

Bioinformatics

• Finding signals and motifs in DNA and proteins

• Expectation Maximization Algorithm

• MEME

• The Gibbs sampler

Lecture 10

• An alignment of sequences is intrinsically connected with another essential task, which is finding certain signals and motifs (highly conservative ungapped blocks) shared by some sequences.

• A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. Motifs are represented as position-dependent scoring matrices that describe the score of each possible letter at each position in the pattern.

• Another related task is searching biological databases for sequences that contain one or more of known motifs.

• These objectives are critical in analysis of genes and proteins, as any gene or protein contains a set of different motifs and signals. Complete knowledge about locations and structure of such motifs and signals leads to a comprehensive description of a gene or protein and indicates at a potential function.

Finding signals and motifs in DNA and proteins

• eMotif is very useful method of identifying motifs in proteins

• MSA of a particular set of proteins is submitted to eMotif, which essentially searches for consensus sequence(s) and identifies the conservative motifs.

• The probability of a motif is estimated from the frequencies of the individual amino acids in the SwissProt DB as a product of probabilities of each position in the consensus

• The result could be as follows: This motif matches 25 out of the 30 sequences supplied. It will match 1 in 10 19 random sequences, or less than 1 sequence in the current SWISS-PROT database.

• Then a motif can be searched in the Swiss-Prot DB

The eMOTIF method of motif analysis

eMOTIF

True positives

eMOTIF: search of sequences with certain emotif in the DB

• This algorithm is used to identify conserved areas in unaligned DNA and proteins.

• Assume that a set of sequences is expected to have a common sequence pattern.

• An initial guess is made as to location and size of the site of interest in each of the sequences and these parts are loosely aligned.

• This alignment provides an estimate of base or aa composition of each column in the site.

• The EM algorithm consists of the two steps, which are repeated consecutively.

• Step 1, the expectation step, the column-by-column composition of the site is used to estimate the probability of finding the site at any position in each of the sequences. These probabilities are used to provide new information as to expected base or aa distribution for each column in the site.

• Step 2, the maximization step, the new counts for bases or aa for each position in the site found in the step 1 are substituted for the previous set.

Expectation Maximization (EM) Algorithm


OOOOOOOOXXXXOOOOOOOOOOOOOOOOXXXXOOOOOOOO o o o o o o o o o o o o o o o o o o o o o o o o

OOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO IIII

IIIIIIII IIIIIII

Columns defined by a preliminary alignment of the sequences provide initial estimates of frequencies of aa in each motif column

Bases Background Site column 1 Site column 2 ……

G 0.27 0.4 0.1 ……

C 0.25 0.4 0.1 ……

A 0.25 0.2 0.1 ……

T 0.23 0.2 0.7 ……

Total 1.00 1.00 1.00 ……

Columns not in motif provide background frequencies


The resulting score gives the likelihood that the motif matches positions A, B or other in seq 1. Repeat for all other positions and find most likely locator. Then repeat for the remaining seq’s.

A

B

XXXXOOOOOOOOOOOOOOOO

XXXX

IIII

IIIIIIIIIIIIIIII

OXXXXOOOOOOOOOOOOOOO

XXXX

IIII

I IIIIIIIIIIIIIII

…background frequencies in the remaining positions.X

Use previous estimates of aa or nucleotide frequencies for each column in the motif to calculate probability of motif in this position, and multiply by……..

• Assume that the seq1 is 20 bases long and the length of the site is 20 bases.

• Suppose that the site starts in the column 1 and the first two positions are A and T.

• The site will end at the position 20 and positions 21 and 22 do not belong to the site. Assume that these two positions are A and T also.

• The Probability of this location of the site in seq1 is given by

Psite1,seq1 = 0.2 (for A in position 1) x 0.7 (for T in position 2) x Ps (for the next 18 positions in site) x 0.25 (for A in first flanking position) x 0.23 (for T in second flanking position x Ps (for the next 78 flanking positions).

• The same procedure is applied for calculation of probabilities for Psite2,seq1 to Psite78, seq1, thus providing a comparative set of probabilities for the site location.

• The probability of the best location in seq1, say at site k, is the ratio of the site probability at k divided by the sum of all the other site probabilities.

• Then the procedure repeated for all other sequences.

EM Algorithm 1st expectation step : calculations

• The site probabilities for each seq calculated at the 1st step are then used to create a new table of expected values for base counts for each of the site positions using the site probabilities as weights.

• Suppose that P (site 1 in seq 1) = Psite1,seq1 / (Psite1,seq1 + Psite2,seq1 + …+ Psite78,seq1 ) = 0.01 and P (site 2 in seq 1) = 0.02.

• Then this values are added to the previous table as shown in the table below.

• This procedure is repeated for every other possible first columns in seq1 and then the process continues for all other sequences resulting in a new version of the table.

• The expectation and maximization steps are repeated until the estimates of base frequencies do not change.

EM Algorithm 2nd optimisation step: calculations

Bases Background Site column 1 Site column 2 ……

G 0.27 + … 0.4 + … 0.1 + … ……

C 0.25 + … 0.4 + … 0.1 + … ……

A 0.25 + … 0.2 + 0.01 0.1 + … ……

T 0.23 + … 0.2 + … 0.7 + 0.02 ……

Total/

weighted

1.00 1.00 1.00 ……

Multiple EM for Motif Elicitation - MEME

MEME: Summary Line

• This line gives the width (‘width’), number of occurrences in the training set (‘sites’), log likelihood ratio (‘llr’) and E-value of the motif. Each motif describes a pattern of a fixed width and no gaps are allowed in MEME motifs. MEME numbers the motifs consecutively from one as it finds them. MEME usually finds the most statistically significant (low E-value) motifs first.

•The statistical significance of a motif is based on its log likelihood ratio, its width and number of occurrences, the background letter frequencies (given in the command line summary), and the size of the training set.

•The E-value is an estimate of the expected number of motifs with the given log likelihood ratio (or higher), and with the same width and number of occurrences, that one would find in a similarly sized set of random sequences. (In random sequences each position is independent with letters chosen according to the background letter frequencies.)

• The log likelihood ratio is the logarithm of the ratio of the probability of the occurrences of the motif given the motif model (likelihood given the motif) versus their probability given the background model (likelihood given the null model). (Normally the background model is a 0-order Markov model using the background letter frequencies, but higher order Markov models may be specified via the -bfile option to MEME.)

•Clicking on the buttons to the left of the motif summary line takes you to the previous motif (P) or next motif (N).

MEME: Summary Line

MEME

MOTIF 1 width = 26 sites = 5 llr = 244 E-value = 5.0e-006

MEME

• The Gibbs sampler algorithm is slightly different from the EM approach. The method also searches for the statistically most probable motifs and can find the optimal width and the number of motifs in each sequence.

• The method iterates through two steps. In the first step a random start position for the motif is chosen for all sequences but for one. These seq. are then aligned and used to find an initial guess of the motif.

• The objective of the next step is to find the most probable pattern common to left out sequence (and on the next iterations to all of the sequences) by sliding them back and forth until the ratio of the motif probability to the background probability is a maximum.

• Then the next sequence is left out and the process is repeated until the residue frequencies in each motif do not change. The number of iterations may range from several hundred to several thousand.

• Several additional statistical procedure are used to improve the performance of the algorithm. The Gibbs sampler was used to align sequences with very little sequences similarity.

The Gibbs Sampler

Steps of the Gibbs sampler algorithm

xxxxxxxMxxxxxxx xxxxxxxMxxxxxxx

xxxxxxxxxxMxxxx xxxxxxxxxxMxxxx

xMxxxxxxxxxxxxx xMxxxxxxxxxxxxx

xxxxxxxxxxxxxxM xxxxxxxxxxxxxxM

xxxxxMxxxxxxxxx xxxxxMxxxxxxxxx

Random start Location of motif in each sequence provides

positions chosen first estimate of motif composition

A. Estimate the aa or base frequencies in the motif columns of all but the 1 sequence. Also obtain background

Motif

B. Use the estimate from A to calculate the ratio of probability of motif to background score at each position

in the left out sequence. This ratio for each possible location in the sequence is the weight of the position.

xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx

M - > M - > M - > M - > M - >

C. Choose a new location for the motif in the left out sequence by a random selection using the weights to bias

the choice.

xxxxxxxxxxMxx Estimated locations of the motif in left out sequence

D. Repeat steps A to C >>times

The outlier sequence

All sequences except the outlier

x is equal to n seq. positions

M indicates random location of the motif in each seq.

- indicates initially aligned motif positions

Documents

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10