Using microarray for identifying regulatory motifs and analysing time series gene expression Outline: Using Bayesian statistics in classifying genes in

Using microarray for identifying regulatory motifs and analysing time series gene expression

Outline:

Using Bayesian statistics in classifying genes in microarray analysis

Combining gene expression data and sequence data for transcription site discovery

Time Series Gene Expression analysis

Pietro Lió – EMBL-EBI and Dept of Computer Science, Cambridge University - [email protected] [email protected]

mailto:[email protected]





Types of questions

What genes are expressed in a particular cell type of an organism, at a particular time, under particular conditions?

Comparison of gene expression between normal and diseased (e.g., cancerous) cells.

A typical dimension of such an array is about 1 inch or less.

The spot diameter is of the order of 0.1 mm, for some microarray types can be even smaller.

One of the most popular microarray applications allows to compare gene expression levels in two different samples, e.g., the same cell type in a healthy and diseased state.

Microarray statistics can be supervised or unsupervised.

In the unsupervised case, only the expression data are available and the goal is mainly to identify distinct sets of genes with similar expressions, suggesting that they may be biologically related.

In supervised problems a response measurement is also available for each sample and the goal of the experiment is to find sets of genes that, for example, relate to different kind of diseases, so that future tissue samples can be correctly classified.

Classification of genes in microarray analysis

Bayesian Statistics

Bayesian statistics models all source of uncertainty (physical randomness, subjective opinions, prior ignorance, etc) with probability distributions and then trying to find the a posteriori distribution of all unknown variable of interest after considering the data.

Bayes, T. 1763. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London 53:370-418. Reprinted, E. S. Pearson and M. G. Kendall (eds.). 1970. Pages 131-153 in Studies in the History of Statistics and Probability. Charles Griffin, London.

D

DD

Pr

Pr|Pr|Pr

In Bayesian statistics, evidences in favour of certain parameter θ is considered :

Inference is based on posterior distribution, Pr(θ|D) which is the conditional distribution of the parameter given the data, D. It is a combination of the prior information and the data.

The term Pr(θ) represents the prior distribution on the parameter. Prior information can be based on previous experiments, or theoretical consideration. The term Pr(D|θ) can be a likelihood and the denominator Pr(D) is a normalizing factor.

First write down the likelihood function, i.e. the probability of the observed data given the unknowns, and multiply it by the a priori distribution.

Let D denote the observed data and θ the unobserved parameter.

Through the application of the Bayes theorem we can obtain the posterior distribution:

|Pr

Pr

|Pr

|Pr

|Pr|Pr D

D

D

dD

DD

When θ is discrete, the integral is replaced by summation.

Bayesian Learning

Flat (“uninformative prior”)

Posterior is prior modified by data

0 1θ

MCMC

The most useful method to approximate the posterior probability is Monte Carlo Markov chain (MCMC).

MCMC, ……. …consider a robot random walking in a landscape

with hills…(these hills will be the genes involved in a disease or a set of regulatory regions but you will see later)

Imagine a bivariate normal hills and the Markov chain represents the positions of steps taken by a robot.

RULE: The step would be rejected if the elevation at proposed new location is less than that of the current location, and the uniform random number drawn was greater than the ratio of new elevation to old elevation.

hills A proposed move is always accepted if it is “uphill” and is often accepted if it is not too much “downhill” from the current position.

The inner and the outer circles represent 50% and 95% probability contours for the bivariate normal density

The robot visits points in proportion to their height. Since the height represents the value of the probability density function, we can simply count the dots in order to obtain an approximation of the probability corresponding to a given area. In other words, the necessary integration of the density function is approximated by the MCMC run.

BURN IN: the initial steps (the “burn in”) are often discarded in a MCMC run.

A proposed move is always accepted if it is “uphill” and is often accepted if it is not too much “downhill” from the current position.

Formally, in the Metropolis-Hastings (Metrolis et al., 1953, Hastings, 1970) algorithm, a move from θ to θ* is accepted with probability min(r,1), where

|*Pr|Pr

*|Pr|*Pr

D

Dr

Longer runs of the chain result in better approximations to posterior probabilities, and stopping too soon can have serious consequences (chain may not have been visited large portions of the parameters space).

Below is a plot showing the natural logaritm of the posterior probability density (i.e. the elevation) for each point visited by the robot. These are often called “history plots” .

Plots like this are often made for MCMC runs to see how well the chain is “mixing”.

Mixing and MCMCMC

Setting the step size too small runs the risk of missing the big picture: the robot may explore one hill in the landscape very well but not take steps large enough to ever traverse the gap between hills.

MCMCMC (MC)3

The solution is to run several Markov chains simultaneously, letting one of them (the “cold”) chain to be the one that counts, and the others (called “the heated” chains) serve as scouts for the cold chain.

Data set on Golub et al. (Science 286, 531-537, 1999) on cancer classification that has 38 samples and expression profiles of thousands of genes. Samples are classified into acute lymphoblastic and acute myeloid leukemia. 34 further samples available for validation.

Data set of Alon et al. (PNAS 96, 6745-6750,1999) on colon cancer classification that has a total of 62 samples and expression profiles of 2000 genes.

The method is based on using a regression setting and a Bayesian priors to identify the important genes.

The model assigns a prior probability to all possible subsets of genes and then proceeds by searching for those subsets that have high posterior probability.

This method is able to locate small sets of genes that lead to a good classification results.

MCMC and microarray

The MCMC involves 2 steps: (i) A subset of genes is proposed by

stochastically adding or dropping one gene each time.

(ii) This set is then either accepted or rejected with a probability described by Metropolis and Hastings .

If a new set is accepted, then it is subjected to further perturbation.

Leukemia classification: posterior probabilities for single genes in 9 runs

genes

Output: set of genes that differ among the two types of leukemia

Leukemia classification: renormalized posterior probabilities for single genes

Cystatin A and C belong to cystein protease inhibitors. Cystein proteases induce apoptosis of leukemia cells.

IL-8 is a chemokine released in response to an inflammatory stimulus that attracts and activates several types of leukocytes. IL-8 gene is close to several other genes involved in cancer, such as the interferon gamma induced factor.

Adipsin (Complement factor D) is a serine protease secreted by adipocytes involved in antibacterial host defense. Note that IL-8 is also released from human adipose tissue.

Zyxin is a LIM-domain protein localized in focal cell- adhesion.

Some results from leukemia data set

Some results from Colon cancer data set CRP contains 2 copies of LIM domain (acting cell

growth factors); its expression is parallel to that of c-myc that has an abnormally high expression in a wide variety of human tumours. It interacts with cytoskeleton and cell- adhesion proteins.

Myosin regulatory light chain 2 binds calcium ions and regulate contractile activity of several cell types. It is present in several EST cDNA extracted from a large number of tumours.

Regulatory regions discovery using gene expression data

Related genes that have similar expression profiles are likely to have similar regulation mechanisms as there must be a reason why their expression is similar under a variety of conditions (not always true!).

If we cluster the genes by similarities in their expression profiles and take sets of promoter sequences from genes in such clusters, some of these sets of sequences may contain a ‘pattern’ which is relevant to regulation of these genes

Related to Reverse engineering of genetic networks

Combining sequence and gene expression data to find transcription sites

The search for transcription sites often starts from

the assessment of the statistical significance of over- or under-representation of oligomers in complete genomes that may indicate

phenomena of positive or negative selection.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…gene1

…gene2

…gene3

…gene4

…gene5

…gene6

…gene7

AAAAGAGTCA

AAATGACTCA

AAGTGAGTCA

AAAAGAGTCA

GGATGAGTCA

AAATGAGTCA

GAATGAGTCA

AAAAGAGTCA

**********

Finding the sequences (protein binding sites) that affect the gene expression

MotifGene expression

2,0~, NXY

Strategy diagram

Rank all genes by expression and obtain their upstream sequences

Find motifs from promoters of most induced and most repressed genes

Score each upstream sequence for matches to each considered motif

Perform variable selection on the significant motifs to find group of motifs acting together to affect expression

MCMC and microarray

The MCMC involves 2 steps: A new subset of motifs is proposed

by stochastically adding or dropping one motif each time.

This set is then either accepted or rejected with a probability described by Metropolis and Hastings .

Gene expression data, combined with sequence analysis, seem to represent a promising method to identify transcription sites.

We consider both yeast sequences and expression data and outputs statistically significant motifs: the regions upstream genes contribute to the log-expression level of a gene.

First we have first identified a number of sequence patterns upstream the yeast genes and then tested for candidate transcription sites.

0 2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

Finding all cell-cycle genes from gene expression data

Time measure

Intensity

Periodically expressed genes

Non periodically expressed genes

Cell-cycle experiments

J.B. Fourier (1768-1830)

time frequency

0 2 4 6 8 10 12 14 16 18 200.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Suppose Yi(t) denotes the expression of gene i at time t.

For all genes the vector Yi(t) was fit first with least squares to Yi(t)=aiS(t)+biC(t)+Ri(t), where S(t)=sin(2t/T) and C(t)=cos(2t/T) with T is the period of the cell cycle.

Yi(t) can be decomposed into a periodic component Zi(t)=aiSi(t)+biCi(t) and a component Ri(t) that is aperiodic.

Regression using Fourier terms

The proportion of variance explained by the Fourier basis (Fourier proportion of variance explained -PVE) is the ratiomi=var(Zi(t))/var(Yi(t)), which can range from 0 to 1.

Values closer to 1 indicate greater sinusoidal expression, whereas values closer to 0 indicate lack of periodicity or periodicity with a period that is substancially different.

The fitted waveform Z(t) resembles a sine wave. For each gene, the phase of Z(t) (equal to time of maximal expression) was determined.

0 2 4 6 8 10 12 14 16 18 200.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5

1

1.5

2

2.5

3

3.5

0 2 4 6 8 10 12 14 16 18 200.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

0 2 4 6 8 10 12 14 16 18 200.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 2 4 6 8 10 12 14 16 18 20-0.5

0

0.5

1

1.5

2

2.5

3

3.5

Wavelets A wavelet is an oscillation that decays quickly

are related to fractals in that the same shapes repeat at different orders of magnitude.

outperforms a Fourier when the signal under consideration contains discontinuities and sharp spikes

differ from Fourier methods in that they allow the localization of a signal in both time and frequency

A wavelet is an oscillation that decays quicklyWavelet basis are related to fractals in that the same shapes repeat at different orders of magnitude

A function can be represented in wavelet domain by the an infinite series expansion of dilated and translated version of a function :

Each wavelet function jk is multiplied by a wavelet coefficient

J, scale parameterK, position parameter

When modeling time/frequency phenomena no precision in timeand frequency at the same time!

f(t)

t t

Impossible!

t

Instant

frequency

Fourier Local Fourier

t

t

Wavelet transform the time/frequency plane is patched with rectangles of different shape

t

Wavelet methods differ from Fourier methods in that they allow the localization of a signal in both time and frequency

The sum of the squaresof the wavelet coefficientsat each scale is termed a scalogram

Wavelet coefficients: the upper ones describe gross features of the signal, the bottom ones descrive little details

Denoising and extraction of components of figure a

Scale J

Position,k

0 0.5 1 1.5 2 2.5 3

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Wavelet scale (fine to coarse)

Var

ianc

e

Periodic

The wavelet variance of periodically expressed genes is 100 times larger than that of non periodic genes

Analysis of single time series

Replacing ‘global’ variability with variability over scales allows us to investigate the effects of constraints acting at different time or sequence-length scales

0 0.5 1 1.5 2 2.5 3

-5

-4

-3

-2

-1

0

1

2

3

4

5

x 10-3

Wavelet scale (fine to coarse)

Var

ianc

e

wavelet variance is a scale by scale decomposition of the variance of a signal.

Confidence intervals

Non Periodic

Cell cycle analysis

0 0.5 1-2

0

2

0 0.5 1-2

0

2

0 0.5 1-5

0

5

0 0.5 1-10

0

10

0 0.5 1-10

0

10

0 0.5 1-5

0

5

0 0.5 1-20

-10

0

10

0 0.5 1-20

-10

0

10

0 0.5 1-5

0

5

Errors for Anova in time domain

Errors for Anova in wavelet domain

Reconstruction by an Anova in time domain

Reconstruction by an Anova in wavelet domain

RESULTS:Residual errors

are smaller in wavelet

domain -> BETTER

ESTIMATE OF VARIANCE

USING WAVELETS

Bonferroni method belongs to a class of methods called family-wise type I error rate (FWE), which is the probability of committing at least one error in the family of hypotheses. The main problem with classical procedures, such the Bonferroni correction, which hinder many applications in biology, is that they tend to have substantially less power than uncorrected procedures.

Recent FEW methods have been used in microarray by Dudoit et al (2002)

The multiplicity problem in microarray data does not require a protection against even a single type I error, so that the severe loss of power involved in such protection is not justified. Instead, it may be more appropriate to emphasize the proportion of errors among the identified differentially expressed genes. The expectation of this proportion is the False Discovery Rate (FDR) of Benjamini and Hochberg (1995). The FDR is the expected proportion of true null hypotheses rejected out of the total number of null hypotheses rejected.

FDR false discovery rate

Adaptive - FDR procedure Given a q (for example q=0.05) Order the p-values

p(1) p(2) … p(m)

Let

r = max (i : p(i) i · q / m )

Reject

H0(1), H0

(2), … H0(r)

Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57:289--300.

Benjamini, Y, Yekutieli D (2003). The control of the false discovery rate under dependence. Annals of Statistics. To appear.

Thanks toAlvis BrazmaMarina VannucciNick Goldman

Documents

Using microarray for identifying regulatory motifs and analysing time series gene expression Outline: Using Bayesian statistics in classifying genes in